<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Dear all, in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099223#M23767</link>
    <description>&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;Dear all,&lt;/P&gt;

&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;I have a problem with the result of MKL MP_Linkpack. In my system, I have 24 compute nodes with both Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz and Xeon Phi Q7200, RAM 256GB. On each node, I run ./runme_intel64, the performance is good ~ 700-900 GFlops (only Xeon CPU).&lt;/P&gt;

&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;But when I run HPL on 4 nodes, 8 nodes or more, the result is very bad, sometimes it cannot return the result with the error: MPI TERMINATED,... After that, I run the test (runme_intel64) on each node again, and the performance is very low:&lt;/P&gt;

&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;~ 11,243 GFLops,&lt;/P&gt;

&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;~ 10,845 GFlops,&lt;/P&gt;

&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;....&lt;/P&gt;

&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;But I don't know the reason&amp;nbsp;why,&amp;nbsp;I guess the reason is&amp;nbsp;power&amp;nbsp;of cluster (it is not enough for a whole system) and HPE Bios configured is Balanced Mode for the cluster (automatically change to lower power mode when the system cannot get enough the power). But when I just run on some nodes and configure the power is maximum, the problem is still not solved.&lt;/P&gt;

&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;Please help me&amp;nbsp;about&amp;nbsp;this problem, thank you all!&lt;/P&gt;</description>
    <pubDate>Tue, 21 Feb 2017 15:06:17 GMT</pubDate>
    <dc:creator>MChun4</dc:creator>
    <dc:date>2017-02-21T15:06:17Z</dc:date>
    <item>
      <title>How to run Intel® Optimized MP LINPACK Benchmark on KNL platform?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099221#M23765</link>
      <description>&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;My KNL platform is based on &lt;/SPAN&gt;&lt;SPAN style="font-size: 1em;"&gt;Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz, 1 node, 64 cores,64GB memory. I have some problems in linpack benchmark.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Before I use&amp;nbsp;Intel® Optimized MP LINPACK Benchmark for Clusters, I have used HPL 2.2 and&amp;nbsp;&lt;SPAN style="font-size: 13.008px;"&gt;Intel Optimized MP LINPACK Benchmark. In HPL 2.2 and&amp;nbsp;Intel Optimized MP LINPACK Benchmark, the result is bad. The highest result is 486 Gflops when I use HPL 2.2 and &lt;/SPAN&gt;683.6404 Gflops&amp;nbsp;when I use&amp;nbsp;&lt;SPAN style="font-size: 13.008px;"&gt;Intel Optimized MP LINPACK Benchmark.&amp;nbsp;&lt;/SPAN&gt;However, the theoretical peak performance is 1*64*1.3*32=2662.4 Gflops.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;So I am confused. This result looks like the AVX512 is not used?&amp;nbsp;Where am i doing it wrong?&amp;nbsp;&lt;/P&gt;

&lt;P&gt;In HPL 2.2 test, I set the N=82800, NB=336, P=4 ,Q=16, and "mpiexec -n 64 ./xhpl", I get the best result (486 Gflops) in HPL 2.2. I also test&amp;nbsp;&lt;SPAN style="font-size: 13.008px;"&gt;N=82800, NB=336, P=8 ,Q=32, and&amp;nbsp;"mpiexec -n 256 ./xhpl", but because of no&lt;/SPAN&gt;&amp;nbsp;enough&amp;nbsp;&lt;SPAN style="font-size: 13.008px;"&gt;memory, the result is low.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;I try to use&amp;nbsp;&lt;SPAN style="font-size: 13.008px;"&gt;Intel® Optimized MP LINPACK Benchmark for Clusters now. But I get trouble on run it. If I run a small test, such as &amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;PRE class="brush:bash;" style="font-size: 13.008px;"&gt;mpiexec -np 8 ./xhpl -n 10000 -b 336 -p 2 -q 4&lt;/PRE&gt;

&lt;P&gt;I can get a result.&lt;/P&gt;

&lt;P&gt;Even if I enlarge the N and Nb, such as&amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:bash;" style="font-size: 13.008px;"&gt;mpiexec -np 32 ./xhpl -n 83000 -b 336 -p 4 -q 8&lt;/PRE&gt;

&lt;P&gt;I can get a result too.&lt;/P&gt;

&lt;P&gt;But when I set the p*q=64 or &amp;nbsp;more, some problem happened.&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;[root@knl mp_linpack]$ mpiexec -np 64 ./xhpl -n 83000 -b 336 -p 4 -q 16
Number of Intel(R) Xeon Phi(TM) coprocessors : 0
Rank 0: First 5 column_factors=1 1 1 1 1
HPL[knl] pthread_create Error in HPL_pdupdate.
&lt;/PRE&gt;

&lt;P&gt;The test is closed directly.&lt;/P&gt;

&lt;P&gt;So what should I do to get the higher result in linpack benchmark?&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Mon, 20 Feb 2017 07:41:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099221#M23765</guid>
      <dc:creator>danquxunhuan</dc:creator>
      <dc:date>2017-02-20T07:41:34Z</dc:date>
    </item>
    <item>
      <title>Could you try running the</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099222#M23766</link>
      <description>&lt;P&gt;Could you try running the benchmark without the mpiexec? On a single node, we do not need to use multiple MPI processes to get the best performance. You could try something like this:&lt;/P&gt;

&lt;P&gt;&lt;FONT face="Courier New"&gt;./xhpl -n 83000 -b 336 &lt;/FONT&gt;&lt;/P&gt;

&lt;P&gt;Then, when you go to multi-node, please use 1 MPI process per node for the KNL systems.&lt;/P&gt;

&lt;P&gt;Thank you.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 21 Feb 2017 08:47:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099222#M23766</guid>
      <dc:creator>Murat_G_Intel</dc:creator>
      <dc:date>2017-02-21T08:47:52Z</dc:date>
    </item>
    <item>
      <title>Dear all,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099223#M23767</link>
      <description>&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;Dear all,&lt;/P&gt;

&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;I have a problem with the result of MKL MP_Linkpack. In my system, I have 24 compute nodes with both Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz and Xeon Phi Q7200, RAM 256GB. On each node, I run ./runme_intel64, the performance is good ~ 700-900 GFlops (only Xeon CPU).&lt;/P&gt;

&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;But when I run HPL on 4 nodes, 8 nodes or more, the result is very bad, sometimes it cannot return the result with the error: MPI TERMINATED,... After that, I run the test (runme_intel64) on each node again, and the performance is very low:&lt;/P&gt;

&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;~ 11,243 GFLops,&lt;/P&gt;

&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;~ 10,845 GFlops,&lt;/P&gt;

&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;....&lt;/P&gt;

&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;But I don't know the reason&amp;nbsp;why,&amp;nbsp;I guess the reason is&amp;nbsp;power&amp;nbsp;of cluster (it is not enough for a whole system) and HPE Bios configured is Balanced Mode for the cluster (automatically change to lower power mode when the system cannot get enough the power). But when I just run on some nodes and configure the power is maximum, the problem is still not solved.&lt;/P&gt;

&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;Please help me&amp;nbsp;about&amp;nbsp;this problem, thank you all!&lt;/P&gt;</description>
      <pubDate>Tue, 21 Feb 2017 15:06:17 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099223#M23767</guid>
      <dc:creator>MChun4</dc:creator>
      <dc:date>2017-02-21T15:06:17Z</dc:date>
    </item>
    <item>
      <title>Thank you for your answer.</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099224#M23768</link>
      <description>&lt;P&gt;Thank you for your answer.&lt;/P&gt;

&lt;P&gt;I followed your advice and had a try with (work&amp;nbsp;&lt;SPAN style="font-size: 13.008px;"&gt;mp_linpack)&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;./xhpl -n 83000 -b 336&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;and The result I got was&lt;SPAN style="font-size: 13.008px;"&gt;&amp;nbsp;&lt;/SPAN&gt;716.506Gflops.&lt;/P&gt;

&lt;P&gt;This result is&amp;nbsp;the best I've ever had. But the theoretical performance a single&amp;nbsp;&lt;SPAN style="font-size: 12px;"&gt;Intel(R) Xeon Phi(TM) CPU 7210 node is 2662.4Gflops. &lt;/SPAN&gt;There is still a big gap between this result and the theoretical&amp;nbsp;&lt;SPAN style="font-size: 13.008px;"&gt;performance.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;When I was running the Linpack test program, I tried using the monitoring software.&amp;nbsp;And I found that&amp;nbsp;the use rate of CPU is only about 25%.&amp;nbsp;&lt;SPAN style="font-size: 1em;"&gt;I learned from a material that When I running HPL on this&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 13.008px;"&gt;platform&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em;"&gt;, I can use all 256 threads. But after I used the "top" command, I found that the "xhpl"&lt;/SPAN&gt;&amp;nbsp;process has used 6400% CPU. And I had tried set the environment &amp;nbsp;by&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;export OMP_NUM_THREADS=256
export MKL_NUM_THREADS=256&lt;/PRE&gt;

&lt;P&gt;in order to change threads number, but I failed. The result&amp;nbsp;did not change.&lt;/P&gt;

&lt;P&gt;I remenber that when I use "mpiexec -np 64 ./xhpl " in HPL2.2, this HPL program establishes 64 &lt;SPAN style="font-size: 13.008px;"&gt;threads&lt;/SPAN&gt;, each process using a 100% CPU. And w&lt;SPAN style="font-size: 13.008px;"&gt;hen I use "mpiexec -np 256 ./xhpl " in HPL2.2, I find that this&amp;nbsp;program establishes 256 threads, each process using a 100% CPU. Although they both can't get idea result.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;It seems that this process only used 64&amp;nbsp;&lt;SPAN style="font-size: 13.008px;"&gt;threads? How can I use all of threads? Or&lt;/SPAN&gt;&lt;SPAN style="font-size: 12px;"&gt;&amp;nbsp;what should I do to get the higher result in linpack benchmark?&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px;"&gt;And another problem. Is &lt;/SPAN&gt;&lt;SPAN style="font-size: 13.008px;"&gt;use multiple MPI processes commands caused the error blow?&lt;/SPAN&gt;&lt;/P&gt;

&lt;PRE class="brush:bash;" style="color: rgb(0, 0, 0);"&gt;[root@knl mp_linpack]$ mpiexec -np 64 ./xhpl -n 83000 -b 336 -p 4 -q 16
Number of Intel(R) Xeon Phi(TM) coprocessors : 0
Rank 0: First 5 column_factors=1 1 1 1 1
HPL[knl] pthread_create Error in HPL_pdupdate.&lt;/PRE&gt;

&lt;P&gt;Thanks.&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Murat Efe Guney (Intel) wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Could you try running the benchmark without the mpiexec? On a single node, we do not need to use multiple MPI processes to get the best performance. You could try something like this:&lt;/P&gt;

&lt;P&gt;./xhpl -n 83000 -b 336&lt;/P&gt;

&lt;P&gt;Then, when you go to multi-node, please use 1 MPI process per node for the KNL systems.&lt;/P&gt;

&lt;P&gt;Thank you.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 22 Feb 2017 12:47:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099224#M23768</guid>
      <dc:creator>danquxunhuan</dc:creator>
      <dc:date>2017-02-22T12:47:00Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;... I learned from a</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099225#M23769</link>
      <description>&amp;gt;&amp;gt;... I learned from a material that When I running HPL on this platform, I can use &lt;STRONG&gt;all 256 threads&lt;/STRONG&gt;...

A top performance on your &lt;STRONG&gt;KNL&lt;/STRONG&gt; system will be when only &lt;STRONG&gt;64 cores&lt;/STRONG&gt; and &lt;STRONG&gt;64 OpenMP threads&lt;/STRONG&gt; are used ( spread across all cores ). That is,
...
export OMP_NUM_THREADS=64
export MKL_NUM_THREADS=64
...
need to be executed instead.

Also, try to set &lt;STRONG&gt;KMP_AFFINITY&lt;/STRONG&gt; environment variable to:
...
export KMP_AFFINITY=scatter
or
export KMP_AFFINITY=scatter,verbose
...

With &lt;STRONG&gt;KMP_AFFINITY&lt;/STRONG&gt; set to &lt;STRONG&gt;compact&lt;/STRONG&gt; or &lt;STRONG&gt;balanced&lt;/STRONG&gt; modes performance could be worse when compared to &lt;STRONG&gt;scatter&lt;/STRONG&gt; mode. I recommend you to test all of these modes.</description>
      <pubDate>Thu, 23 Feb 2017 20:42:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099225#M23769</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2017-02-23T20:42:00Z</dc:date>
    </item>
    <item>
      <title>Thank you for your answer.</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099226#M23770</link>
      <description>&lt;P&gt;&lt;SPAN style="font-size: 12px;"&gt;Thank you for your answer.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;In a training course I learned that&amp;nbsp;&lt;SPAN style="font-size: 1em;"&gt;"KNL support&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em;"&gt;4 threads, in other words, this 7210 KNL can run up to 256 MPI&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 13.008px;"&gt;threads." "The hardware setting is the best for HPL test currently."&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="21223.jpg"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/9399iB4B670F2F3EBA0D9/image-size/large?v=v2&amp;amp;px=999&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="21223.jpg" alt="21223.jpg" /&gt;&lt;/span&gt;&lt;/P&gt;

&lt;P&gt;So I said I can use all 256 threads.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;And I tried to set&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;export OMP_NUM_THREADS=64&lt;/SPAN&gt;&lt;BR style="font-size: 13.008px;" /&gt;
	&lt;SPAN style="font-size: 13.008px;"&gt;export MKL_NUM_THREADS=64&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;and run (mp_linpack)&amp;nbsp;&lt;/SPAN&gt;Intel® Optimized MP LINPACK Benchmark for Clusters, the&amp;nbsp;&lt;SPAN style="font-size: 13.008px;"&gt;result is still bad.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;I try to&amp;nbsp;&lt;SPAN style="font-size: 13.008px;"&gt;google&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-weight: 700; font-size: 13.008px;"&gt;KMP_AFFINITY&lt;/SPAN&gt;&lt;SPAN style="font-size: 13.008px;"&gt;&amp;nbsp;&lt;/SPAN&gt;and I find this is a environment variables for OPENMP not MPI. &amp;nbsp;The software&amp;nbsp;environment is Intel composer, Intel MPI and Intel MKL.&lt;/P&gt;

&lt;P&gt;I also try it. But it seems still not works.( The results have some fluctuate, but there are&amp;nbsp;&lt;SPAN style="font-size: 12px;"&gt;still a big gap between this results and the theoretical&amp;nbsp;performance.)&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;I have some questions.&lt;/P&gt;

&lt;P&gt;1. As you say,&lt;/P&gt;

&lt;P&gt;"&lt;SPAN style="font-size: 13.008px;"&gt;A top performance on your&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-weight: 700; font-size: 13.008px;"&gt;KNL&lt;/SPAN&gt;&lt;SPAN style="font-size: 13.008px;"&gt;&amp;nbsp;system will be when only&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-weight: 700; font-size: 13.008px;"&gt;64 cores&lt;/SPAN&gt;&lt;SPAN style="font-size: 13.008px;"&gt;&amp;nbsp;and&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-weight: 700; font-size: 13.008px;"&gt;64 OpenMP threads&lt;/SPAN&gt;&lt;SPAN style="font-size: 13.008px;"&gt;&amp;nbsp;are used ( spread across all cores )"&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;So how can I run the mp_linpack(Intel® Optimized MP LINPACK Benchmark for Clusters &lt;A href="https://software.intel.com/en-us/node/528619)" target="_blank"&gt;https://software.intel.com/en-us/node/528619)&lt;/A&gt; ? "./xhpl -n 83000 -b 336 " or "mpiexec -np 64 ./xhpl&amp;nbsp;&lt;SPAN style="font-size: 13.008px;"&gt;-n 83000 -b 336 -p 4 -q 16" or other?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;2.&amp;nbsp;I also tested the HPL2.2 and&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 13.008px;"&gt;Intel® Optimized LINPACK Benchmark for Linux* (&lt;/SPAN&gt;&lt;SPAN style="color: rgb(102, 102, 102); font-family: Arial, Tahoma, Helvetica, sans-serif; font-size: 13px;"&gt;which runs on a single platform,&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 13.008px;"&gt;&lt;A href="https://software.intel.com/en-us/node/528615)" target="_blank"&gt;https://software.intel.com/en-us/node/528615)&lt;/A&gt;, but the result is still not good. &lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;For HPL2.2, do you know how to test it ?&lt;/P&gt;

&lt;P&gt;"mpiexec -np 64 ./xhpl" and in HPL.dat, N=83000 Nb=336 P=4 Q=16&lt;/P&gt;

&lt;P&gt;or&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;SPAN style="font-size: 13.008px;"&gt;"mpiexec -np 256 ./xhpl" and in HPL.dat, N=83000 Nb=336 P=8 Q=32&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;or other?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;And in the&amp;nbsp;Intel® Optimized LINPACK Benchmark for Linux*&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="color: rgb(34, 34, 34); font-family: Consolas, &amp;quot;Lucida Console&amp;quot;, &amp;quot;Courier New&amp;quot;, monospace; font-size: 12px; white-space: pre-wrap;"&gt;Developer Guide&lt;/SPAN&gt;&lt;SPAN style="font-size: 13.008px;"&gt;&amp;nbsp;, there are only b&lt;/SPAN&gt;&lt;SPAN style="font-size: 13.008px;"&gt;rief introduction, no input files&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 13.008px;"&gt;introduction. How to test it?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px;"&gt;&amp;nbsp;3.The theoretical performance to a single&amp;nbsp;Intel(R) Xeon Phi(TM) CPU 7210 node is 2662.4Gflops.&amp;nbsp;&lt;/SPAN&gt;&amp;nbsp;But the top result I get is&amp;nbsp;&lt;SPAN style="font-size: 12px;"&gt;716.506Gflops. This confused me much. How to get close to the&amp;nbsp;theoretical performance?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;Thanks.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Sergey Kostrov wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&amp;gt;&amp;gt;... I learned from a material that When I running HPL on this platform, I can use &lt;STRONG&gt;all 256 threads&lt;/STRONG&gt;...&lt;/P&gt;

&lt;P&gt;A top performance on your &lt;STRONG&gt;KNL&lt;/STRONG&gt; system will be when only &lt;STRONG&gt;64 cores&lt;/STRONG&gt; and &lt;STRONG&gt;64 OpenMP threads&lt;/STRONG&gt; are used ( spread across all cores ). That is,&lt;BR /&gt;
	...&lt;BR /&gt;
	export OMP_NUM_THREADS=64&lt;BR /&gt;
	export MKL_NUM_THREADS=64&lt;BR /&gt;
	...&lt;BR /&gt;
	need to be executed instead.&lt;/P&gt;

&lt;P&gt;Also, try to set &lt;STRONG&gt;KMP_AFFINITY&lt;/STRONG&gt; environment variable to:&lt;BR /&gt;
	...&lt;BR /&gt;
	export KMP_AFFINITY=scatter&lt;BR /&gt;
	or&lt;BR /&gt;
	export KMP_AFFINITY=scatter,verbose&lt;BR /&gt;
	...&lt;/P&gt;

&lt;P&gt;With &lt;STRONG&gt;KMP_AFFINITY&lt;/STRONG&gt; set to &lt;STRONG&gt;compact&lt;/STRONG&gt; or &lt;STRONG&gt;balanced&lt;/STRONG&gt; modes performance could be worse when compared to &lt;STRONG&gt;scatter&lt;/STRONG&gt; mode. I recommend you to test all of these modes.&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 27 Feb 2017 03:55:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099226#M23770</guid>
      <dc:creator>danquxunhuan</dc:creator>
      <dc:date>2017-02-27T03:55:00Z</dc:date>
    </item>
    <item>
      <title>I've used the official</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099227#M23771</link>
      <description>&lt;P&gt;I've used the official version of hpl-2.2 on a dual nodes Phi 7230 HPC 3 months ago, and I reached 3888.69GFlops(Single node 2123GFlops). thus I think I have some knowledge&amp;nbsp;of configure and optimize&amp;nbsp;hpl-2.2 on KNL platform. &amp;nbsp;I'm sorry to see you got bad performance (486GFlops), but I prefer to assert it was caused by bad compile optimization. show me your specific configuration of Make.intel64 file. Maybe I can help you...who knows?..&lt;/P&gt;</description>
      <pubDate>Tue, 28 Feb 2017 13:49:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099227#M23771</guid>
      <dc:creator>Duo_S_</dc:creator>
      <dc:date>2017-02-28T13:49:55Z</dc:date>
    </item>
    <item>
      <title>Thank you for your answer.</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099228#M23772</link>
      <description>&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;Thank you for your answer.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;My KNL platform is based on Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz, 1 node, 64 cores and 64GB memory.(may add the extra 16G eDRAM memory in KNL?)&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;My software&amp;nbsp;environment is Intel composer, Intel MPI and Intel MKL.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;My top result tested by HPL2.2 is 486Gflops, with&amp;nbsp;N=82800, NB=336, P=4 ,Q=16 in HPL.dat.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Here is my Make.intel64 file:&lt;/SPAN&gt;&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;#  
#  -- High Performance Computing Linpack Benchmark (HPL)                
#     HPL - 2.2 - February 24, 2016                          
#     Antoine P. Petitet                                                
#     University of Tennessee, Knoxville                                
#     Innovative Computing Laboratory                                 
#     (C) Copyright 2000-2008 All Rights Reserved                       
#                                                                       
#  -- Copyright notice and Licensing terms:                             
#                                                                       
#  Redistribution  and  use in  source and binary forms, with or without
#  modification, are  permitted provided  that the following  conditions
#  are met:                                                             
#                                                                       
#  1. Redistributions  of  source  code  must retain the above copyright
#  notice, this list of conditions and the following disclaimer.        
#                                                                       
#  2. Redistributions in binary form must reproduce  the above copyright
#  notice, this list of conditions,  and the following disclaimer in the
#  documentation and/or other materials provided with the distribution. 
#                                                                       
#  3. All  advertising  materials  mentioning  features  or  use of this
#  software must display the following acknowledgement:                 
#  This  product  includes  software  developed  at  the  University  of
#  Tennessee, Knoxville, Innovative Computing Laboratory.             
#                                                                       
#  4. The name of the  University,  the name of the  Laboratory,  or the
#  names  of  its  contributors  may  not  be used to endorse or promote
#  products  derived   from   this  software  without  specific  written
#  permission.                                                          
#                                                                       
#  -- Disclaimer:                                                       
#                                                                       
#  THIS  SOFTWARE  IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
#  ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,  INCLUDING,  BUT NOT
#  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
#  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE UNIVERSITY
#  OR  CONTRIBUTORS  BE  LIABLE FOR ANY  DIRECT,  INDIRECT,  INCIDENTAL,
#  SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL DAMAGES  (INCLUDING,  BUT NOT
#  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
#  DATA OR PROFITS; OR BUSINESS INTERRUPTION)  HOWEVER CAUSED AND ON ANY
#  THEORY OF LIABILITY, WHETHER IN CONTRACT,  STRICT LIABILITY,  OR TORT
#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
#  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 
# ######################################################################
#  
# ----------------------------------------------------------------------
# - shell --------------------------------------------------------------
# ----------------------------------------------------------------------
#
SHELL        = /bin/sh
#
CD           = cd
CP           = cp
LN_S         = ln -fs
MKDIR        = mkdir -p
RM           = /bin/rm -f
TOUCH        = touch
#
# ----------------------------------------------------------------------
# - Platform identifier ------------------------------------------------
# ----------------------------------------------------------------------
#
ARCH         = Linux_Intel64
#
# ----------------------------------------------------------------------
# - HPL Directory Structure / HPL library ------------------------------
# ----------------------------------------------------------------------
#
TOPdir       = /home/user002/benchmark/hpl-2.2
INCdir       = $(TOPdir)/include
BINdir       = $(TOPdir)/bin/$(ARCH)
LIBdir       = $(TOPdir)/lib/$(ARCH)
#
HPLlib       = $(LIBdir)/libhpl.a
#
# ----------------------------------------------------------------------
# - Message Passing library (MPI) --------------------------------------
# ----------------------------------------------------------------------
# MPinc tells the  C  compiler where to find the Message Passing library
# header files,  MPlib  is defined  to be the name of  the library to be
# used. The variable MPdir is only used for defining MPinc and MPlib.
#
MPdir        = /opt/intel/compilers_and_libraries_2017.1.132/linux/mpi
MPinc        = -I$(MPdir)/include64
MPlib        = $(MPdir)/lib64/libmpi.a
#
# ----------------------------------------------------------------------
# - Linear Algebra library (BLAS or VSIPL) -----------------------------
# ----------------------------------------------------------------------
# LAinc tells the  C  compiler where to find the Linear Algebra  library
# header files,  LAlib  is defined  to be the name of  the library to be
# used. The variable LAdir is only used for defining LAinc and LAlib.
#
LAdir        = /opt/intel/compilers_and_libraries_2017.1.132/linux/mkl
ifndef  LAinc
LAinc        = $(LAdir)/include
endif
ifndef  LAlib
LAlib        = -L$(LAdir)/lib/intel64 \
               -Wl,--start-group \
               $(LAdir)/lib/intel64/libmkl_intel_lp64.a \
               $(LAdir)/lib/intel64/libmkl_intel_thread.a \
               $(LAdir)/lib/intel64/libmkl_core.a \
               -Wl,--end-group -lpthread -ldl
endif
#
# ----------------------------------------------------------------------
# - F77 / C interface --------------------------------------------------
# ----------------------------------------------------------------------
# You can skip this section  if and only if  you are not planning to use
# a  BLAS  library featuring a Fortran 77 interface.  Otherwise,  it  is
# necessary  to  fill out the  F2CDEFS  variable  with  the  appropriate
# options.  **One and only one**  option should be chosen in **each** of
# the 3 following categories:
#
# 1) name space (How C calls a Fortran 77 routine)
#
# -DAdd_              : all lower case and a suffixed underscore  (Suns,
#                       Intel, ...),                           [default]
# -DNoChange          : all lower case (IBM RS6000),
# -DUpCase            : all upper case (Cray),
# -DAdd__             : the FORTRAN compiler in use is f2c.
#
# 2) C and Fortran 77 integer mapping
#
# -DF77_INTEGER=int   : Fortran 77 INTEGER is a C int,         [default]
# -DF77_INTEGER=long  : Fortran 77 INTEGER is a C long,
# -DF77_INTEGER=short : Fortran 77 INTEGER is a C short.
#
# 3) Fortran 77 string handling
#
# -DStringSunStyle    : The string address is passed at the string loca-
#                       tion on the stack, and the string length is then
#                       passed as  an  F77_INTEGER  after  all  explicit
#                       stack arguments,                       [default]
# -DStringStructPtr   : The address  of  a  structure  is  passed  by  a
#                       Fortran 77  string,  and the structure is of the
#                       form: struct {char *cp; F77_INTEGER len;},
# -DStringStructVal   : A structure is passed by value for each  Fortran
#                       77 string,  and  the  structure is  of the form:
#                       struct {char *cp; F77_INTEGER len;},
# -DStringCrayStyle   : Special option for  Cray  machines,  which  uses
#                       Cray  fcd  (fortran  character  descriptor)  for
#                       interoperation.
#
F2CDEFS      = -DAdd__ -DF77_INTEGER=int -DStringSunStyle
#
# ----------------------------------------------------------------------
# - HPL includes / libraries / specifics -------------------------------
# ----------------------------------------------------------------------
#
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) -I$(LAinc) $(MPinc)
HPL_LIBS     = $(HPLlib) $(LAlib) $(MPlib)
#
# - Compile time options -----------------------------------------------
#
# -DHPL_COPY_L           force the copy of the panel L before bcast;
# -DHPL_CALL_CBLAS       call the cblas interface;
# -DHPL_CALL_VSIPL       call the vsip  library;
# -DHPL_DETAILED_TIMING  enable detailed timers;
#
# By default HPL will:
#    *) not copy L before broadcast,
#    *) call the BLAS Fortran 77 interface,
#    *) not display detailed timing information.
#
#HPL_OPTS     = -DHPL_DETAILED_TIMING -DHPL_PROGRESS_REPORT
HPL_OPTS     = -DASYOUGO -DHYBRID
#
# ----------------------------------------------------------------------
#
HPL_DEFS     = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
#
# ----------------------------------------------------------------------
# - Compilers / linkers - Optimization flags ---------------------------
# ----------------------------------------------------------------------
#
CC       = mpiicc
CCNOOPT  = $(HPL_DEFS) -O0 -w -nocompchk
#OMP_DEFS = -openmp
#CCFLAGS  = $(HPL_DEFS) -O3 -w -z noexecstack -z relro -z now -nocompchk -Wall
CCFLAGS = $(HPL_DEFS) -O3  -w -ansi-alias -i-static -z noexecstack -z relro -z now -openmp -nocompchk
#
# On some platforms,  it is necessary  to use the Fortran linker to find
# the Fortran internals used in the BLAS library.
#
LINKER       = $(CC)
#LINKFLAGS    = $(CCFLAGS) $(OMP_DEFS) -mt_mpi
LINKFLAGS    = $(CCFLAGS) -openmp -mt_mpi $(STATICFLAG) -nocompchk
#
ARCHIVER     = ar
ARFLAGS      = r
RANLIB       = echo
#
# ----------------------------------------------------------------------
&lt;/PRE&gt;

&lt;P&gt;I'm hoping your answer. You are so warm! Thanks!&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;doctor_duo_sim wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;I've used the official version of hpl-2.2 on a dual nodes Phi 7230 HPC 3 months ago, and I reached 3888.69GFlops(Single node 2123GFlops). thus I think I have some knowledge&amp;nbsp;of configure and optimize&amp;nbsp;hpl-2.2 on KNL platform. &amp;nbsp;I'm sorry to see you got bad performance (486GFlops), but I prefer to assert it was caused by bad compile optimization. show me your specific configuration of Make.intel64 file. Maybe I can help you...who knows?..&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 28 Feb 2017 14:10:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099228#M23772</guid>
      <dc:creator>danquxunhuan</dc:creator>
      <dc:date>2017-02-28T14:10:00Z</dc:date>
    </item>
    <item>
      <title>...I've checked but it seems</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099229#M23773</link>
      <description>&lt;P&gt;...I've checked but&amp;nbsp;it&amp;nbsp;seems nothing wrong about your Make.intel64 file, I'm afraid that I can't figure out why you get &amp;nbsp;so poor score . you can refer to this&amp;nbsp;&lt;A href="https://software.intel.com/en-us/articles/performance-tools-for-software-developers-hpl-application-note"&gt;page&lt;/A&gt;, it may help you, as for the 16GB HBM, you should restart the sever and check your bios to make sure you have set it as cache to get best performance.&lt;/P&gt;</description>
      <pubDate>Wed, 01 Mar 2017 06:43:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099229#M23773</guid>
      <dc:creator>Duo_S_</dc:creator>
      <dc:date>2017-03-01T06:43:00Z</dc:date>
    </item>
    <item>
      <title>Thanks for your answer.</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099230#M23774</link>
      <description>&lt;P&gt;Thanks for your answer.&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;You are so warm!&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;I will refer this page to restart HPL test again.&lt;/P&gt;

&lt;P&gt;And I hope you can give me some help about runing HPL.&lt;/P&gt;

&lt;P&gt;Can you show me your HPL.dat and HPL.out contents with y&lt;SPAN style="font-size: 1em;"&gt;our best result? And your running command such as&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em;"&gt;"mpirun -np 64 ./xhpl" or others? Or any other running settings such as &lt;/SPAN&gt;&lt;SPAN style="font-size: 12px;"&gt;environment&amp;nbsp;variable setting?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;I think this may help me to solve this problem.&lt;/P&gt;

&lt;P&gt;Thanks!&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Duo S. wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;...I've checked but&amp;nbsp;it&amp;nbsp;seems nothing wrong about your Make.intel64 file, I'm afraid that I can't figure out why you get &amp;nbsp;so poor score . you can refer to this&amp;nbsp;&lt;A href="https://software.intel.com/en-us/articles/performance-tools-for-software-developers-hpl-application-note"&gt;page&lt;/A&gt;, it may help you, as for the 16GB HBM, you should restart the sever and check your bios to make sure you have set it as cache to get best performance.&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 01 Mar 2017 12:12:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099230#M23774</guid>
      <dc:creator>danquxunhuan</dc:creator>
      <dc:date>2017-03-01T12:12:00Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...My top result tested by</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099231#M23775</link>
      <description>&amp;gt;&amp;gt;...My top result tested by HPL2.2 is 486Gflops, with N=82800, &lt;STRONG&gt;NB=336&lt;/STRONG&gt;, P=4 ,Q=16 in HPL.dat.

&lt;STRONG&gt;1&lt;/STRONG&gt;. You're using an option &lt;STRONG&gt;NB=336&lt;/STRONG&gt; and this is a recommended default value for a &lt;STRONG&gt;KNL&lt;/STRONG&gt; system with &lt;STRONG&gt;72&lt;/STRONG&gt; cores. Could you try a value &lt;STRONG&gt;NB=256&lt;/STRONG&gt; instead?

&lt;STRONG&gt;2&lt;/STRONG&gt;. I've executed a &lt;STRONG&gt;micprun&lt;/STRONG&gt; benchmark test and take a look at my report attached. Here are some numbers:
...
[ DESCRIPTION ] 7680 x 7680 MKL DGEMM with 0 threads and 3 iterations
[ PERFORMANCE ] Task.Computation.Avg &lt;STRONG&gt;1874.40&lt;/STRONG&gt; GFlops R
...
[ DESCRIPTION ] hpcg Local Dimensions nx=160, ny=160, nz=160, MPI ranks 4, threads per rank 32
[ PERFORMANCE ] Computation.Avg &lt;STRONG&gt;42.2244&lt;/STRONG&gt; GFlops R
...
[ DESCRIPTION ] HPLinpack problem size 100000 block size 336
[ PERFORMANCE ] Computation.Avg &lt;STRONG&gt;1709.2&lt;/STRONG&gt; GFlops R
...
Performance Summary (GFlops)

Size   LDA    Align.  Average  Maximal
40960  41024  4       &lt;STRONG&gt;1682.3623&lt;/STRONG&gt; &lt;STRONG&gt;1684.5587&lt;/STRONG&gt;
...
[ DESCRIPTION ] 40960 x 40960 MKL DP LINPACK with 64 threads and 3 iterations
[ PERFORMANCE ] Computation.Avg &lt;STRONG&gt;1682.3623&lt;/STRONG&gt; GFlops R
...
testing XGEMM( 'N', 'N', n, n, ... )

          n           min             avg           max        stddev
      15872    3944.35    3950.08    3960.58  7.421e+00
*     15872   3944.35    3950.08     3960.58  7.421e+00

[ DESCRIPTION ] 15872 x 15872 MKL SGEMM with 0 threads and 3 iterations
[ PERFORMANCE ] Task.Computation.Avg &lt;STRONG&gt;3950.08&lt;/STRONG&gt; GFlops R
...</description>
      <pubDate>Wed, 01 Mar 2017 17:14:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099231#M23775</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2017-03-01T17:14:20Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...The highest result is</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099232#M23776</link>
      <description>&amp;gt;&amp;gt;...
&amp;gt;&amp;gt;The highest result is 486 Gflops when I use HPL 2.2 and 683.6404 Gflops when I use Intel Optimized MP LINPACK Benchmark.
&amp;gt;&amp;gt;However, the theoretical peak performance is &lt;STRONG&gt;1*64*1.3*32=2662.4 Gflops&lt;/STRONG&gt;...
&amp;gt;&amp;gt;...

I'm not surprized that real &lt;STRONG&gt;GFlops&lt;/STRONG&gt; numbers for a &lt;STRONG&gt;KNL&lt;/STRONG&gt; system are lower ( nothing is wrong with it! ) and it could be due too many reasons. That simple calculation, I mean &lt;STRONG&gt;1*64*1.3*32=2662.4 Gflops&lt;/STRONG&gt;, doesn't take into account performance overheads from internal ( some time is spent to execute Non FPU instructions ) and external sources ( different OS services, etc, running at the same time ).

Here is a set of results for a &lt;STRONG&gt;KNL&lt;/STRONG&gt; server for a matrix multiplication using &lt;STRONG&gt;MKL&lt;/STRONG&gt;'s &lt;STRONG&gt;sgemm&lt;/STRONG&gt; function:

&lt;STRONG&gt;[ 16384 x 16384 ]&lt;/STRONG&gt; Peak: &lt;STRONG&gt;1442.51&lt;/STRONG&gt; GFlops

&lt;STRONG&gt;[ 32768 x 32768 ]&lt;/STRONG&gt; Peak: &lt;STRONG&gt;1455.22&lt;/STRONG&gt; GFlops

&lt;STRONG&gt;[ 65536 x 65536 ]&lt;/STRONG&gt; Peak: &lt;STRONG&gt;1477.93&lt;/STRONG&gt; GFlops

&lt;STRONG&gt;[ 81920 x 81920 ]&lt;/STRONG&gt; Peak: &lt;STRONG&gt;1347.65&lt;/STRONG&gt; GFlops

&lt;STRONG&gt;[ 98304 x 98304 ]&lt;/STRONG&gt; Peak: &lt;STRONG&gt;1287.24&lt;/STRONG&gt; GFlops

&lt;STRONG&gt;[ 114688 x 114688 ]&lt;/STRONG&gt; Peak: &lt;STRONG&gt;1332.90&lt;/STRONG&gt; GFlops

&lt;STRONG&gt;Tests completed on&lt;/STRONG&gt;:

Intel Xeon Phi Processor 7210 ( 16GB, 1.30 GHz, 64 core )
Processor name     : Intel(R) Xeon Phi(TM) 7210
Packages (sockets) : 1
Cores              : 64
Processors (CPUs)  : 256
Cores per package  : 64
Threads per core   : 4

RAM: 96GB
MCDRAM: 16 GB

Cluster mode: All2All
MCDRAM mode: Flat

Environment variables: KMP_AFFINITY=scatter

Operating system: Red Hat Enterprise Linux  3.10.0-327.13.1</description>
      <pubDate>Wed, 01 Mar 2017 20:25:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099232#M23776</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2017-03-01T20:25:00Z</dc:date>
    </item>
    <item>
      <title>Duo S wrote:</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099233#M23777</link>
      <description>&lt;STRONG&gt;Duo S wrote&lt;/STRONG&gt;:

&amp;gt;...
&amp;gt;&amp;gt;...and I reached &lt;STRONG&gt;3888.69&lt;/STRONG&gt; GFlops
&amp;gt;...

It looks too high and I recommend to complete an MKL based verification. You know that Intel's MKL API is highly optimized to get a peak performance for a system.</description>
      <pubDate>Wed, 01 Mar 2017 20:45:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099233#M23777</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2017-03-01T20:45:06Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;That simple calculation, I</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099234#M23778</link>
      <description>&amp;gt;&amp;gt;That simple calculation, I mean 1*64*1.3*32=2662.4 Gflops, doesn't take into account performance overheads from internal ( some
&amp;gt;&amp;gt;time is spent to execute Non FPU instructions ) and &lt;STRONG&gt;external sources ( different OS services, etc, running at the same time )&lt;/STRONG&gt;.

On &lt;STRONG&gt;Windows&lt;/STRONG&gt; operating system I've done some performance evaluation in a &lt;STRONG&gt;Safe Mode&lt;/STRONG&gt; when number of running &lt;STRONG&gt;OS&lt;/STRONG&gt; services is minimal ( less than 10 ).</description>
      <pubDate>Wed, 01 Mar 2017 20:52:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099234#M23778</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2017-03-01T20:52:00Z</dc:date>
    </item>
    <item>
      <title>Hi Sergey,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099235#M23779</link>
      <description>&lt;P&gt;Hi Sergey,&lt;/P&gt;

&lt;P&gt;I am on same platform as yours&amp;nbsp;&lt;SPAN style="font-size: 12px;"&gt;Intel Xeon Phi Processor 7210 (16GB, 1.30 GHz, 64 core). What I want to observe is thread based performance impact starting from 1 thread mapped to 1 core (rest turned off), then 2 thread mapped to 2 different cores (rest turned off) &amp;nbsp;....so on... to 256 threads mapped to 64 core different cores. For initial analysis, I can do away with mapping thread to core, but want to have specific number of threads based on number of active cores.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px;"&gt;For such test, which benchmark would you suggest and what I should be aware about? I tried DeepBench, but need to figure out how to make use of threading in it.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px;"&gt;Thanks.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 29 Aug 2017 06:05:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099235#M23779</guid>
      <dc:creator>CPati2</dc:creator>
      <dc:date>2017-08-29T06:05:00Z</dc:date>
    </item>
    <item>
      <title>Estimating "peak" performance</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099236#M23780</link>
      <description>&lt;P&gt;Estimating "peak" performance on KNL is a bit tricky...&amp;nbsp;&amp;nbsp; For my Xeon Phi 7250 processors (68-core, 1.4 GHz nominal), the guaranteed frequency running AVX-512-heavy code is 1.2 GHz.&amp;nbsp; The Xeon Phi core is also a 2-instruction-issue core, but peak performance requires 2 FMA's per cycle -- so any instruction that is not an FMA is a direct subtraction from the maximum available performance.&amp;nbsp;&amp;nbsp; It is difficult to be precise, but it is hard to imagine any encoding of the *DGEMM kernel that does not contain about 20% non-FMA instructions.&lt;/P&gt;

&lt;P&gt;So a ballpark double-precision "adjusted peak" for the Xeon Phi 7250 is&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;P&gt;68 cores * 1.2 GHz * 32 DP FP Ops/Hz * 80% FMA density = 2089 GFLOPS&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;For DGEMM problems that can fit all three arrays into MCDRAM (in flat-quadrant mode), I have seen performance of just over 2000 GFLOPS. I don't understand why, but these runs maintain an average frequency that is significantly higher than 1.2 GHz -- close to 1.4 GHz.&amp;nbsp;&amp;nbsp; The observed performance is ~85% of the "adjusted peak" performance at the observed frequency, which seems pretty reasonable.&lt;/P&gt;

&lt;P&gt;HPL execution is dominated by DGEMM, but the overall algorithm is much more complex.&amp;nbsp; Unlike DGEMM, when I run HPL on KNL I do see frequencies close to the expected power-limited 1.2 GHz.&amp;nbsp;&amp;nbsp; Also unlike DGEMM, when I run HPL I find that the KNL does not reach asymptotic performance for problem sizes that fit into the MCDRAM memory.&amp;nbsp; To get asymptotic performance for larger problems, you need to either run with the MCDRAM in cached mode, or you need an implementation that explicitly stages the data (in large blocks) through MCDRAM.&amp;nbsp;&amp;nbsp; If I recall correctly, asymptotic HPL performance on KNL requires array sizes of at least 50-60 GiB.&amp;nbsp; On clusters, even larger sizes (per KNL) are needed to minimize overhead due to inter-node MPI communication.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 31 Aug 2017 14:21:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099236#M23780</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2017-08-31T14:21:33Z</dc:date>
    </item>
    <item>
      <title>Quote:McCalpin, John</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099237#M23781</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;McCalpin, John (Blackbelt) wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Estimating "peak" performance on KNL is a bit tricky...&amp;nbsp;&amp;nbsp; For my Xeon Phi 7250 processors (68-core, 1.4 GHz nominal), the guaranteed frequency running AVX-512-heavy code is 1.2 GHz.&amp;nbsp; The Xeon Phi core is also a 2-instruction-issue core, but peak performance requires 2 FMA's per cycle -- so any instruction that is not an FMA is a direct subtraction from the maximum available performance.&amp;nbsp;&amp;nbsp; It is difficult to be precise, but it is hard to imagine any encoding of the *DGEMM kernel that does not contain about 20% non-FMA instructions.&lt;/P&gt;&lt;P&gt;So a ballpark double-precision "adjusted peak" for the Xeon Phi 7250 is&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;68 cores * 1.2 GHz * 32 DP FP Ops/Hz * 80% FMA density = 2089 GFLOPS&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;For DGEMM problems that can fit all three arrays into MCDRAM (in flat-quadrant mode), I have seen performance of just over 2000 GFLOPS. I don't understand why, but these runs maintain an average frequency that is significantly higher than 1.2 GHz -- close to 1.4 GHz.&amp;nbsp;&amp;nbsp; The observed performance is ~85% of the "adjusted peak" performance at the observed frequency, which seems pretty reasonable.&lt;/P&gt;&lt;P&gt;HPL execution is dominated by DGEMM, but the overall algorithm is much more complex.&amp;nbsp; Unlike DGEMM, when I run HPL on KNL I do see frequencies close to the expected power-limited 1.2 GHz.&amp;nbsp;&amp;nbsp; Also unlike DGEMM, when I run HPL I find that the KNL does not reach asymptotic performance for problem sizes that fit into the MCDRAM memory.&amp;nbsp; &lt;STRONG&gt;To get asymptotic performance for larger problems, you need to either run with the MCDRAM in cached mode, or you need an implementation that explicitly stages the data (in large blocks) through MCDRAM.&amp;nbsp;&amp;nbsp; If I recall correctly, asymptotic HPL performance on KNL requires array sizes of at least 50-60 GiB.&lt;/STRONG&gt;&amp;nbsp; On clusters, even larger sizes (per KNL) are needed to minimize overhead due to inter-node MPI communication.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Dear&amp;nbsp;McCalpin, John (Blackbelt),&lt;/P&gt;&lt;P&gt;Your information supports really helpful. My architecture is also Xeon Phi 7250 (68-core, 1.4 GHz), but the performance I got for executing HPL just 804 Gflops per 68 cores.&lt;/P&gt;&lt;P&gt;Could you help me explain detail about your guide using MCDRAM? That is I have to set environment variables or I need to modify source code to using MCDRAM memory?&lt;/P&gt;&lt;P&gt;Or could you guide me tunning some parameters in HPL.dat to get good performance?&lt;/P&gt;&lt;P&gt;I hope to hear from you soon.&lt;/P&gt;&lt;P&gt;Thanks a lot.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV id="eJOY__extension_root" style="all:unset"&gt;&amp;nbsp;&lt;/DIV&gt;</description>
      <pubDate>Sat, 31 Aug 2019 08:38:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/How-to-run-Intel-Optimized-MP-LINPACK-Benchmark-on-KNL-platform/m-p/1099237#M23781</guid>
      <dc:creator>Tuyen__Nguyen</dc:creator>
      <dc:date>2019-08-31T08:38:00Z</dc:date>
    </item>
  </channel>
</rss>

