<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Eric, in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-MPI-performance-degradation-on-windows/m-p/1110090#M7440</link>
    <description>&lt;P&gt;Eric,&lt;/P&gt;

&lt;P&gt;I am a noob on placement, here is my suggestion&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;# remove any environment "global" variables relating to affinity
unset OMP_DISPLAY_ENV
unset OMP_NUM_THREADS
unset KMP_AFFINITY
unset OMP_PLACES
unset OMP_PROC_BIND

echo Run with default bindings, no affinity
mpiexec -np 4 -genv OMP_NUM_THREADS 4 ht.exe input.txt 4

echo 1 hw thread per place
mpiexec ht -env OMP_PLACES "{0:1}:4:8" ht.exe input.txt 4 : -env OMP_PLACES "{2:1}:4:8" ht.exe input.txt 4 : -env OMP_PLACES "{4:1}:4:8" ht.exe input.txt 4 : -env OMP_PLACES "{6:1}:4:8" ht.exe input.txt 4

echo 2 hw threads per place
mpiexec ht -env OMP_PLACES "{0:2}:4:8" ht.exe input.txt 4 : -env OMP_PLACES "{2:2}:4:8" ht.exe input.txt 4 : -env OMP_PLACES "{4:2}:4:8" ht.exe input.txt 4 : -env OMP_PLACES "{6:2}:4:8" ht.exe input.txt 4
&lt;/PRE&gt;

&lt;P&gt;IOW remove any and all globally set environment variables that are related to affinity placement and number of threads.&lt;/P&gt;

&lt;P&gt;Then for each test run specify the places, that provide a desirable distribution.&lt;/P&gt;

&lt;P&gt;*** Note, the above is specific to your system configuration. There is likely an mpiexec generic settings that will permit you to use "-np 4" .AND. distribute the affinities equally.&lt;/P&gt;

&lt;P&gt;Note 2: the first mpiexec above with defaults, should affinity pin &lt;STRONG&gt;each process &lt;/STRONG&gt;to a different subset of total number of hardware threads / 4. Then each process will run OpenMP with 4 software threads&amp;nbsp;within the&amp;nbsp;process restricted affinity. You can experiment with adding KMP_AFFINITY variations of the first mpiexe line above however note that OMP_PLACES and KMP_AFFINITY are mutually exclusive.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
    <pubDate>Fri, 16 Dec 2016 15:58:19 GMT</pubDate>
    <dc:creator>jimdempseyatthecove</dc:creator>
    <dc:date>2016-12-16T15:58:19Z</dc:date>
    <item>
      <title>OpenMP-MPI performance degradation on windows</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-MPI-performance-degradation-on-windows/m-p/1110087#M7437</link>
      <description>&lt;P&gt;HI all and thanks for help.&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;&amp;nbsp;I have developed High throughput OpenMP software in C called ht.xx, this software must be multiplatform (linux and windows) but the performance on this two OS is very different. On both OS I use Intel Parallel Studio Cluster Edition 2017 but on linux workstation I have a 36 cores, dual socket of&amp;nbsp; E5-2697 v4 @ 2.30GHz (OS Centos 7.2), while on windows workstation I have a 32 cores, dual socket E5-2697A v4 @ 2.60 GHz (Windows Server 2012 R2).&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;I launch my application on both system as:&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;mpirun&lt;/STRONG&gt; &lt;STRONG&gt;-np&lt;/STRONG&gt; &lt;EM&gt;num_mpitask&lt;/EM&gt; &lt;STRONG&gt;./ht.xx&lt;/STRONG&gt; intput_file &lt;EM&gt;num_ompthreads&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;so the application have two parameter, the first is it's application input file(a list of job) and the second it's the number of OpenMP num threads for each Mpi task.&lt;/P&gt;

&lt;P&gt;The performance on windows and linux is equal if the num_mpitask is 1&lt;SPAN style="font-size: 13.008px;"&gt;(time to solution = 300 ms)&lt;/SPAN&gt;, but for greater mpi tasks number I obtain this results:&lt;/P&gt;

&lt;UL&gt;
	&lt;LI&gt;on linux, increasing the number of Mpi task there are a &amp;nbsp;degradation of performance(e.g for Mpi task = 4 (num_ompthreads=4) each task is completed in 360 ms, so &lt;STRONG&gt;60 ms slower&lt;/STRONG&gt;)&lt;/LI&gt;
	&lt;LI&gt;on windows, increasing the number of Mpi task &lt;SPAN style="font-size: 13.008px;"&gt;there are&lt;/SPAN&gt;&amp;nbsp;a big degradation of performance(&lt;SPAN style="font-size: 13.008px;"&gt;e.g for Mpi task = 4 (num_ompthreads=4)&amp;nbsp;each task is completed 640 ms,so &lt;STRONG&gt;340 ms slower&lt;/STRONG&gt;&lt;/SPAN&gt;)&lt;/LI&gt;
&lt;/UL&gt;

&lt;P&gt;&amp;nbsp;I have tried on windows workstation to change mpi binding of task and openmp affinity of threads on the core but this trials has no effect.&lt;/P&gt;

&lt;P&gt;Someone can help me to explain this behaviour? I don't understand why on linux for multiple mpi_task I have a degration of 60 ms while on windows the time to solution doubled.&lt;/P&gt;

&lt;P&gt;Thanks for attention&lt;/P&gt;

&lt;P&gt;Best regards&lt;/P&gt;

&lt;P&gt;Eric&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 09 Dec 2016 21:43:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-MPI-performance-degradation-on-windows/m-p/1110087#M7437</guid>
      <dc:creator>eric_p_</dc:creator>
      <dc:date>2016-12-09T21:43:04Z</dc:date>
    </item>
    <item>
      <title>Are you linking with MKL? If</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-MPI-performance-degradation-on-windows/m-p/1110088#M7438</link>
      <description>&lt;P&gt;Are you linking with MKL? If so, which configuration of MKL (single thread, or multi-thread)?&lt;/P&gt;

&lt;P&gt;Note, typically you should use the single thread MKL when using multi-thread OpenMP. And conversely&lt;BR /&gt;
	use the multi-thread MKL when not using OpenMP within a process.&lt;/P&gt;

&lt;P&gt;Not following the above rule typically results in oversubscription, however, as you have configured using -np 4&lt;BR /&gt;
	on your Linux system (36 cores), each process should have 36/4 (9) cores per process, clearly in excess of 4.&lt;/P&gt;

&lt;P&gt;What may be happening is the thread pinning of each of the 4 processes is the same.&lt;/P&gt;

&lt;P&gt;OR...&lt;/P&gt;

&lt;P&gt;You may have Hyper Threading enabled on the Windows system, and the affinity pinning is not distributing amongst cores, rather to threads within core.&lt;/P&gt;

&lt;P&gt;Try adding "verbose" to you KMP_AFFINITY environment variable ***&lt;BR /&gt;
	Note, each process on the same system should be assigned a KMP_AFFINITY via an argument passed on your mpirun (to avoid running on the same logical processors)&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 09 Dec 2016 22:41:21 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-MPI-performance-degradation-on-windows/m-p/1110088#M7438</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2016-12-09T22:41:21Z</dc:date>
    </item>
    <item>
      <title>Hi JIm, thanks to your reply.</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-MPI-performance-degradation-on-windows/m-p/1110089#M7439</link>
      <description>&lt;P&gt;Hi JIm, thanks to your reply.&lt;/P&gt;

&lt;P&gt;No I don't use MKL and the Hyper threading is disable on windows workstation.&lt;/P&gt;

&lt;P&gt;I try to change affinity and mpi task binding, but my trials has no effect. The example below (I active verbose mode) is done with 4 mpi task and 4 mpi threads per task:&lt;/P&gt;

&lt;PRE class="brush:plain;"&gt;mpiexec -binding map=0,5,10,15 -genv OMP_DISPLAY_ENV true -genv OMP_NUM_THREADS 4  -genv KMP_AFFINITY=verbose,compact -genv OMP_PROC_BIND close -genv OMP_PLACES cores -np 4 ht.exe input.txt 4

OMP: Warning #181: OMP_PLACES: ignored because KMP_AFFINITY has been defined
OMP: Warning #181: OMP_PROC_BIND: ignored because KMP_AFFINITY has been defined

OPENMP DISPLAY ENVIRONMENT BEGIN
   _OPENMP='201511'
  [host] OMP_CANCELLATION='FALSE'
  [host] OMP_DEFAULT_DEVICE='0'
  [host] OMP_DISPLAY_ENV='TRUE'
  [host] OMP_DYNAMIC='FALSE'
  [host] OMP_MAX_ACTIVE_LEVELS='2147483647'
  [host] OMP_MAX_TASK_PRIORITY='0'
  [host] OMP_NESTED='FALSE'
  [host] OMP_NUM_THREADS='4'
  [host] OMP_PLACES='cores'
  [host] OMP_PROC_BIND='intel'
  [host] OMP_SCHEDULE='static'
  [host] OMP_STACKSIZE='4M'
  [host] OMP_THREAD_LIMIT='2147483647'
  [host] OMP_WAIT_POLICY='PASSIVE'
OPENMP DISPLAY ENVIRONMENT END


OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Warning #181: OMP_PLACES: ignored because KMP_AFFINITY has been defined
OMP: Warning #181: OMP_PROC_BIND: ignored because KMP_AFFINITY has been defined

OPENMP DISPLAY ENVIRONMENT BEGIN
   _OPENMP='201511'
  [host] OMP_CANCELLATION='FALSE'
  [host] OMP_DEFAULT_DEVICE='0'
  [host] OMP_DISPLAY_ENV='TRUE'
  [host] OMP_DYNAMIC='FALSE'
  [host] OMP_MAX_ACTIVE_LEVELS='2147483647'
  [host] OMP_MAX_TASK_PRIORITY='0'
  [host] OMP_NESTED='FALSE'
  [host] OMP_NUM_THREADS='4'
  [host] OMP_PLACES='cores'
  [host] OMP_PROC_BIND='intel'
  [host] OMP_SCHEDULE='static'
  [host] OMP_STACKSIZE='4M'
  [host] OMP_THREAD_LIMIT='2147483647'
  [host] OMP_WAIT_POLICY='PASSIVE'
OPENMP DISPLAY ENVIRONMENT END


OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Warning #181: OMP_PLACES: ignored because KMP_AFFINITY has been defined
OMP: Warning #181: OMP_PROC_BIND: ignored because KMP_AFFINITY has been defined

OPENMP DISPLAY ENVIRONMENT BEGIN
   _OPENMP='201511'
  [host] OMP_CANCELLATION='FALSE'
  [host] OMP_DEFAULT_DEVICE='0'
  [host] OMP_DISPLAY_ENV='TRUE'
  [host] OMP_DYNAMIC='FALSE'
  [host] OMP_MAX_ACTIVE_LEVELS='2147483647'
  [host] OMP_MAX_TASK_PRIORITY='0'
  [host] OMP_NESTED='FALSE'
  [host] OMP_NUM_THREADS='4'
  [host] OMP_PLACES='cores'
  [host] OMP_PROC_BIND='intel'
  [host] OMP_SCHEDULE='static'
  [host] OMP_STACKSIZE='4M'
  [host] OMP_THREAD_LIMIT='2147483647'
  [host] OMP_WAIT_POLICY='PASSIVE'
OPENMP DISPLAY ENVIRONMENT END


OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Warning #181: OMP_PLACES: ignored because KMP_AFFINITY has been defined
OMP: Warning #181: OMP_PROC_BIND: ignored because KMP_AFFINITY has been defined

OPENMP DISPLAY ENVIRONMENT BEGIN
   _OPENMP='201511'
  [host] OMP_CANCELLATION='FALSE'
  [host] OMP_DEFAULT_DEVICE='0'
  [host] OMP_DISPLAY_ENV='TRUE'
  [host] OMP_DYNAMIC='FALSE'
  [host] OMP_MAX_ACTIVE_LEVELS='2147483647'
  [host] OMP_MAX_TASK_PRIORITY='0'
  [host] OMP_NESTED='FALSE'
  [host] OMP_NUM_THREADS='4'
  [host] OMP_PLACES='cores'
  [host] OMP_PROC_BIND='intel'
  [host] OMP_SCHEDULE='static'
  [host] OMP_STACKSIZE='4M'
  [host] OMP_THREAD_LIMIT='2147483647'
  [host] OMP_WAIT_POLICY='PASSIVE'
OPENMP DISPLAY ENVIRONMENT END


OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #156: KMP_AFFINITY: 32 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 16 cores/pkg x 1 threads/core (32 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1 
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 2 
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 3 
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 4 
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 5 
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 6 
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 7 
OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 0 core 8 
OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 0 core 9 
OMP: Info #171: KMP_AFFINITY: OS proc 10 maps to package 0 core 10 
OMP: Info #171: KMP_AFFINITY: OS proc 11 maps to package 0 core 11 
OMP: Info #171: KMP_AFFINITY: OS proc 12 maps to package 0 core 12 
OMP: Info #171: KMP_AFFINITY: OS proc 13 maps to package 0 core 13 
OMP: Info #171: KMP_AFFINITY: OS proc 14 maps to package 0 core 14 
OMP: Info #171: KMP_AFFINITY: OS proc 15 maps to package 0 core 15 
OMP: Info #171: KMP_AFFINITY: OS proc 16 maps to package 1 core 0 
OMP: Info #171: KMP_AFFINITY: OS proc 17 maps to package 1 core 1 
OMP: Info #171: KMP_AFFINITY: OS proc 18 maps to package 1 core 2 
OMP: Info #171: KMP_AFFINITY: OS proc 19 maps to package 1 core 3 
OMP: Info #171: KMP_AFFINITY: OS proc 20 maps to package 1 core 4 
OMP: Info #171: KMP_AFFINITY: OS proc 21 maps to package 1 core 5 
OMP: Info #171: KMP_AFFINITY: OS proc 22 maps to package 1 core 6 
OMP: Info #171: KMP_AFFINITY: OS proc 23 maps to package 1 core 7 
OMP: Info #171: KMP_AFFINITY: OS proc 24 maps to package 1 core 8 
OMP: Info #171: KMP_AFFINITY: OS proc 25 maps to package 1 core 9 
OMP: Info #171: KMP_AFFINITY: OS proc 26 maps to package 1 core 10 
OMP: Info #171: KMP_AFFINITY: OS proc 27 maps to package 1 core 11 
OMP: Info #171: KMP_AFFINITY: OS proc 28 maps to package 1 core 12 
OMP: Info #171: KMP_AFFINITY: OS proc 29 maps to package 1 core 13 
OMP: Info #171: KMP_AFFINITY: OS proc 30 maps to package 1 core 14 
OMP: Info #171: KMP_AFFINITY: OS proc 31 maps to package 1 core 15 
OMP: Info #242: KMP_AFFINITY: pid 11656 thread 0 bound to OS proc set {0}
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #156: KMP_AFFINITY: 32 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 16 cores/pkg x 1 threads/core (32 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1 
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 2 
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 3 
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 4 
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 5 
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 6 
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 7 
OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 0 core 8 
OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 0 core 9 
OMP: Info #242: KMP_AFFINITY: pid 11656 thread 1 bound to OS proc set {1}
OMP: Info #171: KMP_AFFINITY: OS proc 10 maps to package 0 core 10 
OMP: Info #171: KMP_AFFINITY: OS proc 11 maps to package 0 core 11 
OMP: Info #171: KMP_AFFINITY: OS proc 12 maps to package 0 core 12 
OMP: Info #171: KMP_AFFINITY: OS proc 13 maps to package 0 core 13 
OMP: Info #171: KMP_AFFINITY: OS proc 14 maps to package 0 core 14 
OMP: Info #171: KMP_AFFINITY: OS proc 15 maps to package 0 core 15 
OMP: Info #171: KMP_AFFINITY: OS proc 16 maps to package 1 core 0 
OMP: Info #242: KMP_AFFINITY: pid 11656 thread 2 bound to OS proc set {2}
OMP: Info #171: KMP_AFFINITY: OS proc 17 maps to package 1 core 1 
OMP: Info #171: KMP_AFFINITY: OS proc 18 maps to package 1 core 2 
OMP: Info #171: KMP_AFFINITY: OS proc 19 maps to package 1 core 3 
OMP: Info #171: KMP_AFFINITY: OS proc 20 maps to package 1 core 4 
OMP: Info #171: KMP_AFFINITY: OS proc 21 maps to package 1 core 5 
OMP: Info #171: KMP_AFFINITY: OS proc 22 maps to package 1 core 6 
OMP: Info #242: KMP_AFFINITY: pid 11656 thread 3 bound to OS proc set {3}
OMP: Info #171: KMP_AFFINITY: OS proc 23 maps to package 1 core 7 
OMP: Info #171: KMP_AFFINITY: OS proc 24 maps to package 1 core 8 
OMP: Info #171: KMP_AFFINITY: OS proc 25 maps to package 1 core 9 
OMP: Info #171: KMP_AFFINITY: OS proc 26 maps to package 1 core 10 
OMP: Info #171: KMP_AFFINITY: OS proc 27 maps to package 1 core 11 
OMP: Info #171: KMP_AFFINITY: OS proc 28 maps to package 1 core 12 
OMP: Info #171: KMP_AFFINITY: OS proc 29 maps to package 1 core 13 
OMP: Info #171: KMP_AFFINITY: OS proc 30 maps to package 1 core 14 
OMP: Info #171: KMP_AFFINITY: OS proc 31 maps to package 1 core 15 
OMP: Info #242: KMP_AFFINITY: pid 8976 thread 0 bound to OS proc set {0}
OMP: Info #242: KMP_AFFINITY: pid 8976 thread 1 bound to OS proc set {1}
OMP: Info #242: KMP_AFFINITY: pid 8976 thread 2 bound to OS proc set {2}
OMP: Info #242: KMP_AFFINITY: pid 8976 thread 3 bound to OS proc set {3}
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #156: KMP_AFFINITY: 32 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 16 cores/pkg x 1 threads/core (32 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1 
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 2 
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 3 
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 4 
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 5 
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 6 
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 7 
OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 0 core 8 
OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 0 core 9 
OMP: Info #171: KMP_AFFINITY: OS proc 10 maps to package 0 core 10 
OMP: Info #171: KMP_AFFINITY: OS proc 11 maps to package 0 core 11 
OMP: Info #171: KMP_AFFINITY: OS proc 12 maps to package 0 core 12 
OMP: Info #171: KMP_AFFINITY: OS proc 13 maps to package 0 core 13 
OMP: Info #171: KMP_AFFINITY: OS proc 14 maps to package 0 core 14 
OMP: Info #171: KMP_AFFINITY: OS proc 15 maps to package 0 core 15 
OMP: Info #171: KMP_AFFINITY: OS proc 16 maps to package 1 core 0 
OMP: Info #171: KMP_AFFINITY: OS proc 17 maps to package 1 core 1 
OMP: Info #171: KMP_AFFINITY: OS proc 18 maps to package 1 core 2 
OMP: Info #171: KMP_AFFINITY: OS proc 19 maps to package 1 core 3 
OMP: Info #171: KMP_AFFINITY: OS proc 20 maps to package 1 core 4 
OMP: Info #171: KMP_AFFINITY: OS proc 21 maps to package 1 core 5 
OMP: Info #171: KMP_AFFINITY: OS proc 22 maps to package 1 core 6 
OMP: Info #171: KMP_AFFINITY: OS proc 23 maps to package 1 core 7 
OMP: Info #171: KMP_AFFINITY: OS proc 24 maps to package 1 core 8 
OMP: Info #171: KMP_AFFINITY: OS proc 25 maps to package 1 core 9 
OMP: Info #171: KMP_AFFINITY: OS proc 26 maps to package 1 core 10 
OMP: Info #171: KMP_AFFINITY: OS proc 27 maps to package 1 core 11 
OMP: Info #171: KMP_AFFINITY: OS proc 28 maps to package 1 core 12 
OMP: Info #171: KMP_AFFINITY: OS proc 29 maps to package 1 core 13 
OMP: Info #171: KMP_AFFINITY: OS proc 30 maps to package 1 core 14 
OMP: Info #171: KMP_AFFINITY: OS proc 31 maps to package 1 core 15 
OMP: Info #242: KMP_AFFINITY: pid 12472 thread 0 bound to OS proc set {0}
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #156: KMP_AFFINITY: 32 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 16 cores/pkg x 1 threads/core (32 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1 
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 2 
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 3 
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 4 
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 5 
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 6 
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 7 
OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 0 core 8 
OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 0 core 9 
OMP: Info #171: KMP_AFFINITY: OS proc 10 maps to package 0 core 10 
OMP: Info #171: KMP_AFFINITY: OS proc 11 maps to package 0 core 11 
OMP: Info #171: KMP_AFFINITY: OS proc 12 maps to package 0 core 12 
OMP: Info #171: KMP_AFFINITY: OS proc 13 maps to package 0 core 13 
OMP: Info #171: KMP_AFFINITY: OS proc 14 maps to package 0 core 14 
OMP: Info #171: KMP_AFFINITY: OS proc 15 maps to package 0 core 15 
OMP: Info #171: KMP_AFFINITY: OS proc 16 maps to package 1 core 0 
OMP: Info #171: KMP_AFFINITY: OS proc 17 maps to package 1 core 1 
OMP: Info #171: KMP_AFFINITY: OS proc 18 maps to package 1 core 2 
OMP: Info #171: KMP_AFFINITY: OS proc 19 maps to package 1 core 3 
OMP: Info #171: KMP_AFFINITY: OS proc 20 maps to package 1 core 4 
OMP: Info #171: KMP_AFFINITY: OS proc 21 maps to package 1 core 5 
OMP: Info #171: KMP_AFFINITY: OS proc 22 maps to package 1 core 6 
OMP: Info #171: KMP_AFFINITY: OS proc 23 maps to package 1 core 7 
OMP: Info #171: KMP_AFFINITY: OS proc 24 maps to package 1 core 8 
OMP: Info #171: KMP_AFFINITY: OS proc 25 maps to package 1 core 9 
OMP: Info #171: KMP_AFFINITY: OS proc 26 maps to package 1 core 10 
OMP: Info #171: KMP_AFFINITY: OS proc 27 maps to package 1 core 11 
OMP: Info #171: KMP_AFFINITY: OS proc 28 maps to package 1 core 12 
OMP: Info #171: KMP_AFFINITY: OS proc 29 maps to package 1 core 13 
OMP: Info #171: KMP_AFFINITY: OS proc 30 maps to package 1 core 14 
OMP: Info #171: KMP_AFFINITY: OS proc 31 maps to package 1 core 15 
OMP: Info #242: KMP_AFFINITY: pid 11428 thread 0 bound to OS proc set {0}
OMP: Info #242: KMP_AFFINITY: pid 12472 thread 1 bound to OS proc set {1}
OMP: Info #242: KMP_AFFINITY: pid 12472 thread 2 bound to OS proc set {2}
OMP: Info #242: KMP_AFFINITY: pid 12472 thread 3 bound to OS proc set {3}
OMP: Info #242: KMP_AFFINITY: pid 11428 thread 1 bound to OS proc set {1}
OMP: Info #242: KMP_AFFINITY: pid 11428 thread 2 bound to OS proc set {2}
OMP: Info #242: KMP_AFFINITY: pid 11428 thread 3 bound to OS proc set {3}

HT vers 0.1
OMP THREADS = 4 
TIME TO SOLUTION (sec)  = 3.834136e-01 
END HT

HT vers 0.1
OMP THREADS = 4 
TIME TO SOLUTION (sec)  = 7.583328e-01 
END HT.xx

HT vers 0.1
OMP THREADS = 4 
TIME TO SOLUTION (sec)  = 7.535792e-01 
END HT.xx

HT vers 0.1
OMP THREADS = 4 
TIME TO SOLUTION (sec)  = 9.195819e-01 
END HT.xx&lt;/PRE&gt;

&lt;P&gt;and If i run with only 1 mpi task I obtain:&lt;/P&gt;

&lt;PRE class="brush:;"&gt;HT vers 0.1
OMP THREADS = 4 
TIME TO SOLUTION (sec)  = 3.982187e-01 
END HT&lt;/PRE&gt;

&lt;P&gt;What is wrong in my configuration?&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Thanks&lt;/P&gt;

&lt;P&gt;Eric&lt;/P&gt;</description>
      <pubDate>Tue, 13 Dec 2016 19:56:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-MPI-performance-degradation-on-windows/m-p/1110089#M7439</guid>
      <dc:creator>eric_p_</dc:creator>
      <dc:date>2016-12-13T19:56:37Z</dc:date>
    </item>
    <item>
      <title>Eric,</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-MPI-performance-degradation-on-windows/m-p/1110090#M7440</link>
      <description>&lt;P&gt;Eric,&lt;/P&gt;

&lt;P&gt;I am a noob on placement, here is my suggestion&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;# remove any environment "global" variables relating to affinity
unset OMP_DISPLAY_ENV
unset OMP_NUM_THREADS
unset KMP_AFFINITY
unset OMP_PLACES
unset OMP_PROC_BIND

echo Run with default bindings, no affinity
mpiexec -np 4 -genv OMP_NUM_THREADS 4 ht.exe input.txt 4

echo 1 hw thread per place
mpiexec ht -env OMP_PLACES "{0:1}:4:8" ht.exe input.txt 4 : -env OMP_PLACES "{2:1}:4:8" ht.exe input.txt 4 : -env OMP_PLACES "{4:1}:4:8" ht.exe input.txt 4 : -env OMP_PLACES "{6:1}:4:8" ht.exe input.txt 4

echo 2 hw threads per place
mpiexec ht -env OMP_PLACES "{0:2}:4:8" ht.exe input.txt 4 : -env OMP_PLACES "{2:2}:4:8" ht.exe input.txt 4 : -env OMP_PLACES "{4:2}:4:8" ht.exe input.txt 4 : -env OMP_PLACES "{6:2}:4:8" ht.exe input.txt 4
&lt;/PRE&gt;

&lt;P&gt;IOW remove any and all globally set environment variables that are related to affinity placement and number of threads.&lt;/P&gt;

&lt;P&gt;Then for each test run specify the places, that provide a desirable distribution.&lt;/P&gt;

&lt;P&gt;*** Note, the above is specific to your system configuration. There is likely an mpiexec generic settings that will permit you to use "-np 4" .AND. distribute the affinities equally.&lt;/P&gt;

&lt;P&gt;Note 2: the first mpiexec above with defaults, should affinity pin &lt;STRONG&gt;each process &lt;/STRONG&gt;to a different subset of total number of hardware threads / 4. Then each process will run OpenMP with 4 software threads&amp;nbsp;within the&amp;nbsp;process restricted affinity. You can experiment with adding KMP_AFFINITY variations of the first mpiexe line above however note that OMP_PLACES and KMP_AFFINITY are mutually exclusive.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Fri, 16 Dec 2016 15:58:19 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-MPI-performance-degradation-on-windows/m-p/1110090#M7440</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2016-12-16T15:58:19Z</dc:date>
    </item>
    <item>
      <title>Hi JIm, thanks to your reply.</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-MPI-performance-degradation-on-windows/m-p/1110091#M7441</link>
      <description>&lt;P&gt;&lt;SPAN style="font-size: 12px;"&gt;Hi JIm, thanks to your reply.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;I don't kwon why but the problem is related to mpirun command. I introduce in my software &lt;EM&gt;MPI_Init&lt;/EM&gt; and &lt;EM&gt;MPI_Finalize&lt;/EM&gt; and include &lt;STRONG&gt;mpi.h&lt;/STRONG&gt; in order to obtain information with I_MPI_DEBUG to analyze thread and task distribution among the cores. Probably with MPI init, the mpirun create the correct distribution among the physical core.&lt;/P&gt;

&lt;PRE class="brush:plain;"&gt;[0] MPI startup(): Intel(R) MPI Library, Version 2017 Update 1  Build 20161016
[0] MPI startup(): Copyright (C) 2003-2016 Intel Corporation.  All rights reserved.
[0] MPI startup(): Multi-threaded optimized library
[2] MPI startup(): shm data transfer mode
[3] MPI startup(): shm data transfer mode
[0] MPI startup(): shm data transfer mode
[1] MPI startup(): shm data transfer mode
[0] MPI startup(): Internal info: pinning initialization was done
[1] MPI startup(): Internal info: pinning initialization was done
[2] MPI startup(): Internal info: pinning initialization was done
[3] MPI startup(): Internal info: pinning initialization was done
[0] MPI startup(): Device_reset_idx=8
[0] MPI startup(): Allgather: 3: 0-0 &amp;amp; 0-2147483647
[0] MPI startup(): Allgather: 1: 1-337 &amp;amp; 0-2147483647
[0] MPI startup(): Allgather: 5: 338-877 &amp;amp; 0-2147483647
[0] MPI startup(): Allgather: 1: 878-2559 &amp;amp; 0-2147483647
[0] MPI startup(): Allgather: 5: 2560-6359 &amp;amp; 0-2147483647
[0] MPI startup(): Allgather: 1: 6360-13712 &amp;amp; 0-2147483647
[0] MPI startup(): Allgather: 3: 13713-32768 &amp;amp; 0-2147483647
[0] MPI startup(): Allgather: 1: 32769-2150688 &amp;amp; 0-2147483647
[0] MPI startup(): Allgather: 5: 0-2147483647 &amp;amp; 0-2147483647
[0] MPI startup(): Allgatherv: 1: 0-2048 &amp;amp; 0-2147483647
[0] MPI startup(): Allgatherv: 3: 0-2147483647 &amp;amp; 0-2147483647
[0] MPI startup(): Allreduce: 7: 0-0 &amp;amp; 0-2147483647
[0] MPI startup(): Allreduce: 12: 1-5 &amp;amp; 0-2147483647
[0] MPI startup(): Allreduce: 11: 6-11 &amp;amp; 0-2147483647
[0] MPI startup(): Allreduce: 10: 12-26 &amp;amp; 0-2147483647
[0] MPI startup(): Allreduce: 11: 27-76 &amp;amp; 0-2147483647
[0] MPI startup(): Allreduce: 10: 77-256 &amp;amp; 0-2147483647
[0] MPI startup(): Allreduce: 11: 257-658 &amp;amp; 0-2147483647
[0] MPI startup(): Allreduce: 10: 659-1815 &amp;amp; 0-2147483647
[0] MPI startup(): Allreduce: 1: 1816-7645 &amp;amp; 0-2147483647
[0] MPI startup(): Allreduce: 7: 7646-16384 &amp;amp; 0-2147483647
[0] MPI startup(): Allreduce: 2: 16385-47972 &amp;amp; 0-2147483647
[0] MPI startup(): Allreduce: 7: 47973-106959 &amp;amp; 0-2147483647
[0] MPI startup(): Allreduce: 2: 106960-316569 &amp;amp; 0-2147483647
[0] MPI startup(): Allreduce: 7: 316570-590413 &amp;amp; 0-2147483647
[0] MPI startup(): Allreduce: 2: 590414-1473386 &amp;amp; 0-2147483647
[0] MPI startup(): Allreduce: 7: 0-2147483647 &amp;amp; 0-2147483647
[0] MPI startup(): Alltoall: 1: 0-0 &amp;amp; 0-2147483647
[0] MPI startup(): Alltoall: 3: 1-1 &amp;amp; 0-2147483647
[0] MPI startup(): Alltoall: 4: 2-2 &amp;amp; 0-2147483647
[0] MPI startup(): Alltoall: 3: 3-16 &amp;amp; 0-2147483647
[0] MPI startup(): Alltoall: 4: 17-32 &amp;amp; 0-2147483647
[0] MPI startup(): Alltoall: 3: 33-4096 &amp;amp; 0-2147483647
[0] MPI startup(): Alltoall: 2: 4097-8192 &amp;amp; 0-2147483647
[0] MPI startup(): Alltoall: 3: 8193-16384 &amp;amp; 0-2147483647
[0] MPI startup(): Alltoall: 2: 0-2147483647 &amp;amp; 0-2147483647
[0] MPI startup(): Alltoallv: 1: 0-2147483647 &amp;amp; 0-2147483647
[0] MPI startup(): Alltoallw: 0: 0-2147483647 &amp;amp; 0-2147483647
[0] MPI startup(): Barrier: 7: 0-2147483647 &amp;amp; 0-2147483647
[0] MPI startup(): Bcast: 3: 0-0 &amp;amp; 0-2147483647
[0] MPI startup(): Bcast: 9: 1-3290 &amp;amp; 0-2147483647
[0] MPI startup(): Bcast: 8: 3291-11498 &amp;amp; 0-2147483647
[0] MPI startup(): Bcast: 1: 11499-48691 &amp;amp; 0-2147483647
[0] MPI startup(): Bcast: 7: 48692-524288 &amp;amp; 0-2147483647
[0] MPI startup(): Bcast: 0: 524289-2097152 &amp;amp; 0-2147483647
[0] MPI startup(): Bcast: 7: 0-2147483647 &amp;amp; 0-2147483647
[0] MPI startup(): Exscan: 0: 0-2147483647 &amp;amp; 0-2147483647
[0] MPI startup(): Gather: 3: 0-2147483647 &amp;amp; 0-2147483647
[0] MPI startup(): Gatherv: 1: 0-2147483647 &amp;amp; 0-2147483647
[0] MPI startup(): Reduce_scatter: 4: 0-0 &amp;amp; 0-2147483647
[0] MPI startup(): Reduce_scatter: 1: 1-15 &amp;amp; 0-2147483647
[0] MPI startup(): Reduce_scatter: 3: 16-65536 &amp;amp; 0-2147483647
[0] MPI startup(): Reduce_scatter: 2: 0-2147483647 &amp;amp; 0-2147483647
[0] MPI startup(): Reduce: 5: 0-0 &amp;amp; 0-2147483647
[0] MPI startup(): Reduce: 10: 1-44 &amp;amp; 0-2147483647
[0] MPI startup(): Reduce: 8: 45-87 &amp;amp; 0-2147483647
[0] MPI startup(): Reduce: 10: 88-230 &amp;amp; 0-2147483647
[0] MPI startup(): Reduce: 9: 231-563 &amp;amp; 0-2147483647
[0] MPI startup(): Reduce: 10: 564-2153 &amp;amp; 0-2147483647
[0] MPI startup(): Reduce: 8: 2154-6802 &amp;amp; 0-2147483647
[0] MPI startup(): Reduce: 10: 6803-8701 &amp;amp; 0-2147483647
[0] MPI startup(): Reduce: 5: 8702-74514 &amp;amp; 0-2147483647
[0] MPI startup(): Reduce: 1: 0-2147483647 &amp;amp; 0-2147483647
[0] MPI startup(): Scan: 0: 0-2147483647 &amp;amp; 0-2147483647
[0] MPI startup(): Scatter: 1: 0-1785 &amp;amp; 0-2147483647
[0] MPI startup(): Scatter: 3: 1786-21541 &amp;amp; 0-2147483647
[0] MPI startup(): Scatter: 1: 21542-50077 &amp;amp; 0-2147483647
[0] MPI startup(): Scatter: 3: 0-2147483647 &amp;amp; 0-2147483647
[0] MPI startup(): Scatterv: 0: 0-2147483647 &amp;amp; 0-2147483647
[1] MPI startup(): Recognition=2 Platform(code=128 ippn=2 dev=1) Fabric(intra=1 inter=1 flags=0x0)[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       13972    W2K12R2RT  {0,1,2,3,4,5,6,7}
[0] MPI startup(): 1       9524     W2K12R2RT  {8,9,10,11,12,13,14,15}
[0] MPI startup(): 2       14128    W2K12R2RT  {16,17,18,19,20,21,22,23}
[0] MPI startup(): 3       13388    W2K12R2RT  {24,25,26,27,28,29,30,31}
[0] MPI startup(): Recognition=2 Platform(code=128 ippn=2 dev=1) Fabric(intra=1 inter=1 flags=0x0)
[2] MPI startup(): Recognition=2 Platform(code=128 ippn=2 dev=1) Fabric(intra=1 inter=1 flags=0x0)
[3] MPI startup(): Recognition=2 Platform(code=128 ippn=2 dev=1) Fabric(intra=1 inter=1 flags=0x0)
[0] MPI startup(): I_MPI_DEBUG=7
[0] MPI startup(): I_MPI_PIN_MAPPING=4:0 0,1 8,2 16,3 24
[0] MPI startup(): OMP_NUM_THREADS=4

HT vers 0.1
OMP THREADS = 4 
TIME TO SOLUTION (sec)  = 2.840791e-01 
END HT.xx

HT vers 0.1
OMP THREADS = 4  
TIME TO SOLUTION (sec)  = 2.847850e-01 
END HT.xx

HT vers 0.1
OMP THREADS = 4 
TIME TO SOLUTION (sec)  = 2.836015e-01 
END HT.xx

HT vers 0.1
OMP THREADS = 4 
TIME TO SOLUTION (sec)  = 2.823817e-01 
END HT.xx&lt;/PRE&gt;

&lt;P&gt;and with only 1 MPI task I obtain:&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:plain;" style="font-size: 13.008px;"&gt;HT vers 0.1
OMP THREADS = 4 
TIME TO SOLUTION (sec)  = 2.423817e-01 
END HT.xx&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;So, with inclusion of MPI library in the software the Windows results are compatible with the Linux results. I think probably there are a configuration problem on Windows workstation.&lt;/P&gt;

&lt;P&gt;Thanks again&lt;/P&gt;

&lt;P&gt;Eric&lt;/P&gt;</description>
      <pubDate>Thu, 29 Dec 2016 10:07:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/OpenMP-MPI-performance-degradation-on-windows/m-p/1110091#M7441</guid>
      <dc:creator>eric_p_</dc:creator>
      <dc:date>2016-12-29T10:07:33Z</dc:date>
    </item>
  </channel>
</rss>

