<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic   in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/LINPACK-with-multiple-MPI-ranks/m-p/1176237#M28962</link>
    <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Summarize so&amp;nbsp;that&amp;nbsp;more developer may refer.&lt;/P&gt;

&lt;P&gt;two basic points:&lt;/P&gt;

&lt;P&gt;MPI_PROC_NUM='The number of actual physical server, which equals PxQ)&lt;/P&gt;

&lt;P&gt;MPI_PER_NODE='This should be 1 and it doesn't matter if you have single socket or dual socket, if you put 2 for a dual socket system, the memory usage in htop will be shown &amp;nbsp;to use 40% but in fact is using 80%, and there would be 2 controlling threads instead of one controlling thread'&lt;/P&gt;

&lt;P&gt;​b&lt;SPAN style="margin: 0px; color: rgb(31, 73, 125); font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt;"&gt;y default, HPL will use whole resource and creating two HPL will share most of resources,&amp;nbsp;so&amp;nbsp; bad performance in case 2. &lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Thus&amp;nbsp;we recommend :&lt;/P&gt;

&lt;P style="margin: 0px 0px 0px 48px; text-indent: -18pt;"&gt;&lt;SPAN style="margin: 0px; color: rgb(31, 73, 125); font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt;"&gt;&lt;SPAN style="margin: 0px;"&gt;A.&lt;SPAN style="font: 7pt &amp;quot;Times New Roman&amp;quot;; margin: 0px; font-size-adjust: none; font-stretch: normal;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;Save&amp;nbsp; #!/bin/bash&lt;BR /&gt;
	export HPL_HOST_NODE=$(($PMI_RANK * 2 + 0)),$(($PMI_RANK * 2 + 1))&lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="margin: 0px;"&gt;&lt;SPAN style="margin: 0px; color: rgb(31, 73, 125); font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;./xhpl_intel64_dynamic $* &lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="margin: 0px;"&gt;&lt;SPAN style="margin: 0px; color: rgb(31, 73, 125); font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt;"&gt;as runme script and then&amp;nbsp;run &lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="margin: 0px 0px 0px 48px;"&gt;&lt;SPAN style="margin: 0px; color: rgb(31, 73, 125); font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt;"&gt;mpirun –n&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="margin: 0px; color: red; font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt;"&gt;2&lt;/SPAN&gt;&lt;SPAN style="margin: 0px; color: rgb(31, 73, 125); font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt;"&gt; ./runme –p 2 –q 1 -b 384 –n 40000 &lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="margin: 0px 0px 0px 48px;"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P style="margin: 0px 0px 0px 48px;"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P style="margin: 0px 0px 0px 48px;"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P style="background: white; margin: 0px 40px 24px 0px;"&gt;&lt;SPAN lang="EN" style="margin: 0px; color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt;"&gt;Or&amp;nbsp;&amp;nbsp;B: &amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="margin: 0px; color: rgb(31, 73, 125); font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt;"&gt;mpirun -env HPL_HOST_NODE=0,1 -np 1 ./xhpl_intel64_dynamic : -env HPL_HOST_NODE=2, 3 -np 1 ./xhpl_intel64_dynamic&amp;nbsp; (where, p=2, Q=1 in HPL.dat)&lt;/SPAN&gt;&lt;SPAN style="margin: 0px; color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt;"&gt; &lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="background: white; margin: 0px 40px 24px 0px;"&gt;&lt;SPAN lang="EN" style="margin: 0px; color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt;"&gt;Or&amp;nbsp;&amp;nbsp; C.&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN style="margin: 0px; color: rgb(31, 73, 125); font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt;"&gt;mpirun -np 1 ./xhpl_intel64_dynamic -p 1 -q 1&lt;/SPAN&gt;&lt;SPAN lang="EN" style="margin: 0px; color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt;"&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN style="margin: 0px; color: rgb(31, 73, 125); font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt;"&gt;-b 384 –n 40000&lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="background: white; margin: 0px 40px 24px 0px;"&gt;&lt;SPAN style="margin: 0px; color: rgb(31, 73, 125); font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt;"&gt;​Case A and Case&amp;nbsp;B should be similar and as Holger's test.&amp;nbsp; Case c should be&amp;nbsp;almost&amp;nbsp;same&amp;nbsp;result as A and B.&amp;nbsp; and it is&amp;nbsp;also fine&amp;nbsp;if you have more nodes in systems and each&amp;nbsp;node have 1 mpi rank. &amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="background: white; margin: 0px 40px 24px 0px;"&gt;&lt;SPAN style="margin: 0px; color: rgb(31, 73, 125); font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt;"&gt;​Best Regards,&lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="background: white; margin: 0px 40px 24px 0px;"&gt;&lt;SPAN style="margin: 0px; color: rgb(31, 73, 125); font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt;"&gt;​Ying&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="margin: 0px; color: rgb(31, 73, 125); font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 15 May 2018 03:42:12 GMT</pubDate>
    <dc:creator>Ying_H_Intel</dc:creator>
    <dc:date>2018-05-15T03:42:12Z</dc:date>
    <item>
      <title>LINPACK with multiple MPI ranks</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/LINPACK-with-multiple-MPI-ranks/m-p/1176233#M28958</link>
      <description>Hello,

to benchmark our new Skylake cluster consisting of two and four socket machines together with a Broadwell system, I want to be able to run LINPACK with a different amount of MPI ranks per node. My problem is that there are too many processes spawned on the four socket nodes where I launch two MPI ranks. 
I tried to limit the number of threads via OMP_NUM_THREADS and MKL_NUM_THREADS, but without effect. TBB seams to be the cause here, because some MKL functions (which will probably be used in LINPACK) are using this:
&lt;A href="https://software.intel.com/en-us/mkl-macos-developer-guide-functions-threaded-with-intel-threading-building-blocks" target="_blank"&gt;https://software.intel.com/en-us/mkl-macos-developer-guide-functions-threaded-with-intel-threading-building-blocks&lt;/A&gt;
As far as I know, there is no possibility to influence the number of threads with environment variables created with TBB.

So my question is, how to run LINPACK with two MPI ranks on one node (and get the full performance)?

Best regards,
Holger</description>
      <pubDate>Mon, 30 Apr 2018 08:33:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/LINPACK-with-multiple-MPI-ranks/m-p/1176233#M28958</guid>
      <dc:creator>Holger_A_</dc:creator>
      <dc:date>2018-04-30T08:33:16Z</dc:date>
    </item>
    <item>
      <title> </title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/LINPACK-with-multiple-MPI-ranks/m-p/1176234#M28959</link>
      <description>&lt;P style="margin: 0px;"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Hi Holger,&lt;BR /&gt;
	Could you please tell how and which binary are you running for the LINPACK?&lt;BR /&gt;
	​I noticed you refer to the documentation, which is for Mac OS,&amp;nbsp; and You mentioned OpenMP don't work, But TBB seems running.&amp;nbsp; are you worked with Mac OS?&lt;/P&gt;

&lt;P&gt;MKL have release 3 benchmark&lt;/P&gt;

&lt;UL&gt;
	&lt;LI&gt;1) &lt;A href="https://software.intel.com/node/98add226-0e9a-4ccf-ae71-063160bb2e60"&gt;&lt;U&gt;&lt;FONT color="#0066cc"&gt;Intel® Optimized LINPACK Benchmark for Linux*&lt;/FONT&gt;&lt;/U&gt;&lt;/A&gt;&lt;/LI&gt;
	&lt;LI class="ulchildlink"&gt;2) &lt;A href="https://software.intel.com/node/ec8b9c0d-360d-4256-9804-046a12f5bc14"&gt;&lt;U&gt;&lt;FONT color="#0066cc"&gt;Intel® Distribution for LINPACK* Benchmark&lt;/FONT&gt;&lt;/U&gt;&lt;/A&gt;&lt;/LI&gt;
	&lt;LI class="ulchildlink"&gt;3) &lt;A href="https://software.intel.com/node/68b42282-90be-40a5-8ed0-fca9da76806e"&gt;&lt;U&gt;&lt;FONT color="#0066cc"&gt;Intel® Optimized High Performance Conjugate Gradient Benchmark&lt;/FONT&gt;&lt;/U&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;

&lt;P&gt;Not sure which one you are running.&lt;/P&gt;

&lt;P&gt;But if&amp;nbsp;you working with &amp;nbsp;Linux,&amp;nbsp; using the No.2) &amp;nbsp;Intel Distribution for LINPACK Benchmark,&amp;nbsp; then you may&amp;nbsp; &lt;FONT color="#1f497d" face="Calibri" size="3"&gt;use HPL_HOST_CORE (&lt;/FONT&gt;&lt;A href="https://software.intel.com/en-us/mkl-linux-developer-guide-environment-variables"&gt;&lt;U&gt;&lt;FONT color="#0563c1" face="Calibri" size="3"&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/A&gt;&lt;A href="https://software.intel.com/en-us/mkl-linux-developer-guide-environment-variables" target="_blank"&gt;https://software.intel.com/en-us/mkl-linux-developer-guide-environment-variables&lt;/A&gt;&lt;FONT color="#1f497d" face="Calibri" size="3"&gt;). to control number of threads and core usage.&lt;/FONT&gt;&lt;/P&gt;

&lt;P&gt;And if you are working with Mac OS , using the No. 1)&lt;/P&gt;

&lt;P&gt;​As the documentation: &amp;nbsp; &lt;A href="https://software.intel.com/en-us/mkl-macos-developer-guide-known-limitations-of-the-intel-optimized-linpack-benchmark" target="_blank"&gt;https://software.intel.com/en-us/mkl-macos-developer-guide-known-limitations-of-the-intel-optimized-linpack-benchmark&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Known Limitations of the Intel® Optimized LINPACK Benchmark&lt;/P&gt;

&lt;DIV class="field field-name-body field-type-text-with-summary field-label-hidden"&gt;
	&lt;DIV class="field-items"&gt;
		&lt;DIV class="field-item even" property="content:encoded"&gt;
			&lt;DIV class="topic-wrapper" id="8E89A03F-EE8D-4A0E-9640-CEB8AA6F300F"&gt;
				&lt;DIV id="GUID-09AD7A07-919E-494B-AED3-501B6091328D"&gt;
					&lt;P id="GUID-0DEE359B-95B0-41F5-AD2E-3B38C212E2FD"&gt;The following limitations are known for the Intel Optimized LINPACK Benchmark for macOS*:&lt;/P&gt;

					&lt;UL id="GUID-CC0A8538-0164-414D-ABA4-CB7190440D82"&gt;
						&lt;LI id="GUID-261EA51B-60B1-4925-8C6C-6C499785FAE2"&gt;Intel Optimized LINPACK Benchmark supports only OpenMP threading&lt;/LI&gt;
						&lt;LI id="GUID-AE3170CD-7198-41F0-B048-9AEE6FE3FEE1"&gt;If an incomplete data input file is given, the binaries may either hang or fault. See the sample data input files and/or the extended help for insight into creating a correct data input file.&lt;/LI&gt;
						&lt;LI id="GUID-8182261E-3F2B-4B91-A1EC-1DB67799584A"&gt;The binary will hang if it is not given an input file or any other arguments.&lt;/LI&gt;
					&lt;/UL&gt;
				&lt;/DIV&gt;
			&lt;/DIV&gt;
		&lt;/DIV&gt;
	&lt;/DIV&gt;
&lt;/DIV&gt;

&lt;P&gt;So to set OpenMP threading should works,&amp;nbsp; you may use export KMP_AFFINITY=verbose, let's see how many OPENMP threads were spawned.&lt;BR /&gt;
	&lt;BR /&gt;
	​Best Regards,&lt;BR /&gt;
	​Ying&lt;/P&gt;</description>
      <pubDate>Wed, 02 May 2018 02:33:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/LINPACK-with-multiple-MPI-ranks/m-p/1176234#M28959</guid>
      <dc:creator>Ying_H_Intel</dc:creator>
      <dc:date>2018-05-02T02:33:00Z</dc:date>
    </item>
    <item>
      <title>Hi Ying,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/LINPACK-with-multiple-MPI-ranks/m-p/1176235#M28960</link>
      <description>Hi Ying,

thank you for your reply. In fact, the usage of HPL_HOST_CORE , respectively HPL_HOST_NODE is helping me. 
I am using Linpack from MKL under Linux, so most probably option 2).
What I do at the moment is the following (see output_linpack1.txt in the attachment):
mpirun -machinefile $M_NAME -host r05n01 -env HPL_HOST_NODE=0,1 -np 1 /home/holger/.local/easybuild/software/imkl/2018.1.163-iimpi-2018a/mkl/benchmarks/mp_linpack/xhpl_intel64_dynamic : -host r05n01 -env HPL_HOST_NODE=2,3 -np 1 /home/holger/.local/easybuild/software/imkl/2018.1.163-iimpi-2018a/mkl/benchmarks/mp_linpack/xhpl_intel64_dynamic
This delivers about 5.8 TFLOPs at the beginning of Linpack.

When I run (output_linpack2.txt)
mpirun -machinefile $M_NAME -np 2 /home/holger/.local/easybuild/software/imkl/2018.1.163-iimpi-2018a/mkl/benchmarks/mp_linpack/xhpl_intel64_dynamic
on the same machine, the output is the same regarding the thread placement, I think, but you only get 2 TFLOPs

As I test, I ran (output_linpack3.txt)
export I_MPI_PIN_DOMAIN=omp
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
mpirun -machinefile $M_NAME -np $NUM_PROCS /home/holger/.local/easybuild/software/imkl/2018.1.163-iimpi-2018a/mkl/benchmarks/mp_linpack/xhpl_intel64_dynamic
Now it claims to start one thread per MPI process, but in fact (top says so), also starts 36 threads per process. I don't understand, why multiple processes are spawned here. Therefore I think that in the second example there are also more threads per process and both MPI ranks are trying to use the whole machine. 
Unfortunately to use this four socket machine together with our two socket nodes, I would like to be able to start two MPI processes.

Best regards,
Holger</description>
      <pubDate>Wed, 02 May 2018 11:27:36 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/LINPACK-with-multiple-MPI-ranks/m-p/1176235#M28960</guid>
      <dc:creator>Holger_A_</dc:creator>
      <dc:date>2018-05-02T11:27:36Z</dc:date>
    </item>
    <item>
      <title> </title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/LINPACK-with-multiple-MPI-ranks/m-p/1176236#M28961</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Hi Holger,&lt;BR /&gt;
	&lt;BR /&gt;
	​What is your exact test machine?&amp;nbsp;&amp;nbsp; it is 4 socket broad well system.&amp;nbsp;&amp;nbsp; 18 core *2 HT * 4 = 144 thread right?&amp;nbsp; How was your Top looks like?&lt;/P&gt;

&lt;P&gt;there is some discussion in&lt;BR /&gt;
	&lt;A href="https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/605789" target="_blank"&gt;https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/605789&lt;/A&gt;&lt;BR /&gt;
	&lt;BR /&gt;
	​Some hints:&lt;/P&gt;

&lt;P&gt;​1.&amp;nbsp; the MP_LINPACK don't use OPENMP threads, so you may not use OpenMP number to control&amp;nbsp; MKL threads.&lt;/P&gt;

&lt;P&gt;​2. from the output. the case 2 and case 1 should be same, but&amp;nbsp;more clear affinity.&amp;nbsp; from CPU usage,&amp;nbsp; the Case 1 use less&amp;nbsp;CPU, but good performance.&lt;/P&gt;

&lt;P&gt;3.&amp;nbsp; case 3 , which pin mpi rand to 1 node.&amp;nbsp; has almost same performance of case 2.&lt;/P&gt;

&lt;P&gt;The performance may be cause by different memory usage etc.&lt;BR /&gt;
	4. according to our experience,&lt;/P&gt;

&lt;P&gt;MPI_PROC_NUM='The number of actual physical server, which equals PxQ)&lt;/P&gt;

&lt;P&gt;MPI_PER_NODE='This should be 1 and it doesn't matter if you have single socket or dual socket, if you put 2 for a dual socket system, the memory usage in htop will be shown &amp;nbsp;to use 40% but in fact is using 80%, and there would be 2 controlling threads instead of one controlling thread'&lt;/P&gt;

&lt;P&gt;So&amp;nbsp;for your&amp;nbsp;1 node,&amp;nbsp;4 socket,&amp;nbsp;it should be&amp;nbsp; just simple to run &amp;gt; &amp;nbsp;&amp;nbsp; mpirun -np 1 ./xhpl_intel64_dynamic -p 1 -q 1&amp;nbsp;&lt;/P&gt;

&lt;P&gt;( I get almost same result&amp;nbsp; if use case 1 )&lt;/P&gt;

&lt;P&gt;T/V&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; N&amp;nbsp;&amp;nbsp;&amp;nbsp; NB&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; P&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Q&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Time&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Gflops&lt;BR /&gt;
	--------------------------------------------------------------------------------&lt;BR /&gt;
	WC00C2R2&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 10000&amp;nbsp;&amp;nbsp; 384&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2.70&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;STRONG&gt; 2.46543e+02&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;​and for your 2 node , 2 socket sky lake&amp;nbsp;&amp;nbsp; and 4 sockets Broadwell.&amp;nbsp; may be same&amp;nbsp;as your affinity in case 1. &amp;nbsp;&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;Here is the result&amp;nbsp;,which almost be able to&amp;nbsp;reproduce your problem.&amp;nbsp; I &amp;nbsp;try on my 2 socket skylake system:&lt;BR /&gt;
	&lt;BR /&gt;
	root@dell-r640:/opt/intel/compilers_and_libraries_2018.1.163/linux/mkl/benchmarks/mp_linpack# numactl --hardware&lt;BR /&gt;
	available: 2 nodes (0-1)&lt;BR /&gt;
	node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30&lt;BR /&gt;
	node 0 size: 15730 MB&lt;BR /&gt;
	node 0 free: 14356 MB&lt;BR /&gt;
	node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31&lt;BR /&gt;
	node 1 size: 16081 MB&lt;BR /&gt;
	node 1 free: 14894 MB&lt;BR /&gt;
	node distances:&lt;BR /&gt;
	node&amp;nbsp;&amp;nbsp; 0&amp;nbsp;&amp;nbsp; 1&lt;BR /&gt;
	&amp;nbsp; 0:&amp;nbsp; 10&amp;nbsp; 21&lt;BR /&gt;
	&amp;nbsp; 1:&amp;nbsp; 21&amp;nbsp; 10&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;Your Case 2: &lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;s/mp_linpack# mpirun -np 2 ./xhpl_intel64_dynamic&lt;BR /&gt;
	[0] MPI startup(): Multi-threaded optimized library&lt;BR /&gt;
	[0] MPI startup(): shm data transfer mode&lt;BR /&gt;
	[1] MPI startup(): shm data transfer mode&lt;BR /&gt;
	[0] MPI startup(): Rank&amp;nbsp;&amp;nbsp;&amp;nbsp; Pid&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Node name&amp;nbsp; Pin cpu&lt;BR /&gt;
	[0] MPI startup(): 0&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 217575&amp;nbsp;&amp;nbsp; dell-r640&amp;nbsp; {0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30}&lt;BR /&gt;
	[0] MPI startup(): 1&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 217576&amp;nbsp;&amp;nbsp; dell-r640&amp;nbsp; {1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31}&lt;BR /&gt;
	...&lt;BR /&gt;
	dell-r640&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; : Column=009984 Fraction=0.995 Kernel= 4954.61 Mflops=205557.14&lt;BR /&gt;
	================================================================================&lt;BR /&gt;
	T/V&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; N&amp;nbsp;&amp;nbsp;&amp;nbsp; NB&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; P&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Q&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Time&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Gflops&lt;BR /&gt;
	--------------------------------------------------------------------------------&lt;BR /&gt;
	WC00C2R2&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 10000&amp;nbsp;&amp;nbsp; 384&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 3.87&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;STRONG&gt;1.72230e+02&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;Your Case 1: &lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;root@dell-r640:/opt/intel/compilers_and_libraries_2018.1.163/linux/mkl/benchmarks/mp_linpack# mpirun -env HPL_HOST_NODE=0 -np 1 ./xhpl_intel64_dynamic : -env HPL_HOST_NODE=1 -np 1 ./xhpl_intel64_dynamic&lt;BR /&gt;
	[0] MPI startup(): Multi-threaded optimized library&lt;BR /&gt;
	[1] MPI startup(): shm data transfer mode&lt;BR /&gt;
	[0] MPI startup(): shm data transfer mode&lt;BR /&gt;
	[0] MPI startup(): Rank&amp;nbsp;&amp;nbsp;&amp;nbsp; Pid&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Node name&amp;nbsp; Pin cpu&lt;BR /&gt;
	[0] MPI startup(): 0&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 217894&amp;nbsp;&amp;nbsp; dell-r640&amp;nbsp; {0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30}&lt;BR /&gt;
	[0] MPI startup(): 1&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 217895&amp;nbsp;&amp;nbsp; dell-r640&amp;nbsp; {1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31}&lt;BR /&gt;
	[0] MPI startup(): I_MPI_DEBUG=5&lt;BR /&gt;
	[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2&lt;BR /&gt;
	[0] MPI startup(): I_MPI_PIN_MAPPING=2:0 0,1 1&lt;BR /&gt;
	================================================================================&lt;BR /&gt;
	T/V&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; N&amp;nbsp;&amp;nbsp;&amp;nbsp; NB&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; P&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Q&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Time&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Gflops&lt;BR /&gt;
	--------------------------------------------------------------------------------&lt;BR /&gt;
	WC00C2R2&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 10000&amp;nbsp;&amp;nbsp; 384&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2.65&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;STRONG&gt; 2.51188e+02&lt;/STRONG&gt;&lt;BR /&gt;
	HPL_pdgesv() start time Wed May&amp;nbsp; 9 17:20:37 2018&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;

&lt;P&gt;case 3:&lt;BR /&gt;
	s/mp_linpack# mpirun -np 2 ./xhpl_intel64_dynamic&lt;BR /&gt;
	[0] MPI startup(): Multi-threaded optimized library&lt;BR /&gt;
	[0] MPI startup(): shm data transfer mode&lt;BR /&gt;
	[1] MPI startup(): shm data transfer mode&lt;BR /&gt;
	[0] MPI startup(): Rank&amp;nbsp;&amp;nbsp;&amp;nbsp; Pid&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Node name&amp;nbsp; Pin cpu&lt;BR /&gt;
	[0] MPI startup(): 0&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 220999&amp;nbsp;&amp;nbsp; dell-r640&amp;nbsp; {0}&lt;BR /&gt;
	[0] MPI startup(): 1&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 221000&amp;nbsp;&amp;nbsp; dell-r640&amp;nbsp; {16}&lt;BR /&gt;
	[0] MPI startup(): I_MPI_DEBUG=5&lt;BR /&gt;
	[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2&lt;BR /&gt;
	[0] MPI startup(): I_MPI_PIN_MAPPING=2:0 0,1 16&lt;BR /&gt;
	================================================================================&lt;BR /&gt;
	T/V&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; N&amp;nbsp;&amp;nbsp;&amp;nbsp; NB&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; P&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Q&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Time&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Gflops&lt;BR /&gt;
	--------------------------------------------------------------------------------&lt;BR /&gt;
	WC00C2R2&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 10000&amp;nbsp;&amp;nbsp; 384&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 3.76&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1.77269e+02&lt;/P&gt;

&lt;P&gt;Best Regards,&lt;BR /&gt;
	​Ying&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 09 May 2018 06:12:02 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/LINPACK-with-multiple-MPI-ranks/m-p/1176236#M28961</guid>
      <dc:creator>Ying_H_Intel</dc:creator>
      <dc:date>2018-05-09T06:12:02Z</dc:date>
    </item>
    <item>
      <title> </title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/LINPACK-with-multiple-MPI-ranks/m-p/1176237#M28962</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Summarize so&amp;nbsp;that&amp;nbsp;more developer may refer.&lt;/P&gt;

&lt;P&gt;two basic points:&lt;/P&gt;

&lt;P&gt;MPI_PROC_NUM='The number of actual physical server, which equals PxQ)&lt;/P&gt;

&lt;P&gt;MPI_PER_NODE='This should be 1 and it doesn't matter if you have single socket or dual socket, if you put 2 for a dual socket system, the memory usage in htop will be shown &amp;nbsp;to use 40% but in fact is using 80%, and there would be 2 controlling threads instead of one controlling thread'&lt;/P&gt;

&lt;P&gt;​b&lt;SPAN style="margin: 0px; color: rgb(31, 73, 125); font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt;"&gt;y default, HPL will use whole resource and creating two HPL will share most of resources,&amp;nbsp;so&amp;nbsp; bad performance in case 2. &lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Thus&amp;nbsp;we recommend :&lt;/P&gt;

&lt;P style="margin: 0px 0px 0px 48px; text-indent: -18pt;"&gt;&lt;SPAN style="margin: 0px; color: rgb(31, 73, 125); font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt;"&gt;&lt;SPAN style="margin: 0px;"&gt;A.&lt;SPAN style="font: 7pt &amp;quot;Times New Roman&amp;quot;; margin: 0px; font-size-adjust: none; font-stretch: normal;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/SPAN&gt;&lt;/SPAN&gt;Save&amp;nbsp; #!/bin/bash&lt;BR /&gt;
	export HPL_HOST_NODE=$(($PMI_RANK * 2 + 0)),$(($PMI_RANK * 2 + 1))&lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="margin: 0px;"&gt;&lt;SPAN style="margin: 0px; color: rgb(31, 73, 125); font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;./xhpl_intel64_dynamic $* &lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="margin: 0px;"&gt;&lt;SPAN style="margin: 0px; color: rgb(31, 73, 125); font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt;"&gt;as runme script and then&amp;nbsp;run &lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="margin: 0px 0px 0px 48px;"&gt;&lt;SPAN style="margin: 0px; color: rgb(31, 73, 125); font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt;"&gt;mpirun –n&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="margin: 0px; color: red; font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt;"&gt;2&lt;/SPAN&gt;&lt;SPAN style="margin: 0px; color: rgb(31, 73, 125); font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt;"&gt; ./runme –p 2 –q 1 -b 384 –n 40000 &lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="margin: 0px 0px 0px 48px;"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P style="margin: 0px 0px 0px 48px;"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P style="margin: 0px 0px 0px 48px;"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P style="background: white; margin: 0px 40px 24px 0px;"&gt;&lt;SPAN lang="EN" style="margin: 0px; color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt;"&gt;Or&amp;nbsp;&amp;nbsp;B: &amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="margin: 0px; color: rgb(31, 73, 125); font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt;"&gt;mpirun -env HPL_HOST_NODE=0,1 -np 1 ./xhpl_intel64_dynamic : -env HPL_HOST_NODE=2, 3 -np 1 ./xhpl_intel64_dynamic&amp;nbsp; (where, p=2, Q=1 in HPL.dat)&lt;/SPAN&gt;&lt;SPAN style="margin: 0px; color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt;"&gt; &lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="background: white; margin: 0px 40px 24px 0px;"&gt;&lt;SPAN lang="EN" style="margin: 0px; color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt;"&gt;Or&amp;nbsp;&amp;nbsp; C.&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN style="margin: 0px; color: rgb(31, 73, 125); font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt;"&gt;mpirun -np 1 ./xhpl_intel64_dynamic -p 1 -q 1&lt;/SPAN&gt;&lt;SPAN lang="EN" style="margin: 0px; color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt;"&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN style="margin: 0px; color: rgb(31, 73, 125); font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt;"&gt;-b 384 –n 40000&lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="background: white; margin: 0px 40px 24px 0px;"&gt;&lt;SPAN style="margin: 0px; color: rgb(31, 73, 125); font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt;"&gt;​Case A and Case&amp;nbsp;B should be similar and as Holger's test.&amp;nbsp; Case c should be&amp;nbsp;almost&amp;nbsp;same&amp;nbsp;result as A and B.&amp;nbsp; and it is&amp;nbsp;also fine&amp;nbsp;if you have more nodes in systems and each&amp;nbsp;node have 1 mpi rank. &amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="background: white; margin: 0px 40px 24px 0px;"&gt;&lt;SPAN style="margin: 0px; color: rgb(31, 73, 125); font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt;"&gt;​Best Regards,&lt;/SPAN&gt;&lt;/P&gt;

&lt;P style="background: white; margin: 0px 40px 24px 0px;"&gt;&lt;SPAN style="margin: 0px; color: rgb(31, 73, 125); font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt;"&gt;​Ying&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="margin: 0px; color: rgb(31, 73, 125); font-family: &amp;quot;Calibri&amp;quot;,sans-serif; font-size: 11pt;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 15 May 2018 03:42:12 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/LINPACK-with-multiple-MPI-ranks/m-p/1176237#M28962</guid>
      <dc:creator>Ying_H_Intel</dc:creator>
      <dc:date>2018-05-15T03:42:12Z</dc:date>
    </item>
    <item>
      <title>Hi, what is the formula to</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/LINPACK-with-multiple-MPI-ranks/m-p/1176238#M28963</link>
      <description>&lt;P&gt;Hi, what is the formula to calculate N? I searched in the documents but didn't see right formula to calculate 'N' i.e. size of the problem based on available memory of the host.&lt;/P&gt;</description>
      <pubDate>Tue, 02 Jul 2019 23:48:21 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/LINPACK-with-multiple-MPI-ranks/m-p/1176238#M28963</guid>
      <dc:creator>Anup_N_Intel</dc:creator>
      <dc:date>2019-07-02T23:48:21Z</dc:date>
    </item>
  </channel>
</rss>

