topic Eric, in Intel® Moderncode for Parallel Architectures

OpenMP-MPI performance degradation on windows

eric_p_ — Fri, 09 Dec 2016 21:43:04 GMT

HI all and thanks for help.

I have developed High throughput OpenMP software in C called ht.xx, this software must be multiplatform (linux and windows) but the performance on this two OS is very different. On both OS I use Intel Parallel Studio Cluster Edition 2017 but on linux workstation I have a 36 cores, dual socket of E5-2697 v4 @ 2.30GHz (OS Centos 7.2), while on windows workstation I have a 32 cores, dual socket E5-2697A v4 @ 2.60 GHz (Windows Server 2012 R2).

I launch my application on both system as:

mpirun -np num_mpitask ./ht.xx intput_file num_ompthreads

so the application have two parameter, the first is it's application input file(a list of job) and the second it's the number of OpenMP num threads for each Mpi task.

The performance on windows and linux is equal if the num_mpitask is 1(time to solution = 300 ms), but for greater mpi tasks number I obtain this results:

on linux, increasing the number of Mpi task there are a degradation of performance(e.g for Mpi task = 4 (num_ompthreads=4) each task is completed in 360 ms, so 60 ms slower)
on windows, increasing the number of Mpi task there are a big degradation of performance(e.g for Mpi task = 4 (num_ompthreads=4) each task is completed 640 ms,so 340 ms slower)

I have tried on windows workstation to change mpi binding of task and openmp affinity of threads on the core but this trials has no effect.

Someone can help me to explain this behaviour? I don't understand why on linux for multiple mpi_task I have a degration of 60 ms while on windows the time to solution doubled.

Thanks for attention

Best regards

Eric

Are you linking with MKL? If

jimdempseyatthecove — Fri, 09 Dec 2016 22:41:21 GMT

Are you linking with MKL? If so, which configuration of MKL (single thread, or multi-thread)?

Note, typically you should use the single thread MKL when using multi-thread OpenMP. And conversely
use the multi-thread MKL when not using OpenMP within a process.

Not following the above rule typically results in oversubscription, however, as you have configured using -np 4
on your Linux system (36 cores), each process should have 36/4 (9) cores per process, clearly in excess of 4.

What may be happening is the thread pinning of each of the 4 processes is the same.

OR...

You may have Hyper Threading enabled on the Windows system, and the affinity pinning is not distributing amongst cores, rather to threads within core.

Try adding "verbose" to you KMP_AFFINITY environment variable ***
Note, each process on the same system should be assigned a KMP_AFFINITY via an argument passed on your mpirun (to avoid running on the same logical processors)

Jim Dempsey

Hi JIm, thanks to your reply.

eric_p_ — Tue, 13 Dec 2016 19:56:37 GMT

Hi JIm, thanks to your reply.

No I don't use MKL and the Hyper threading is disable on windows workstation.

I try to change affinity and mpi task binding, but my trials has no effect. The example below (I active verbose mode) is done with 4 mpi task and 4 mpi threads per task:

mpiexec -binding map=0,5,10,15 -genv OMP_DISPLAY_ENV true -genv OMP_NUM_THREADS 4  -genv KMP_AFFINITY=verbose,compact -genv OMP_PROC_BIND close -genv OMP_PLACES cores -np 4 ht.exe input.txt 4

OMP: Warning #181: OMP_PLACES: ignored because KMP_AFFINITY has been defined
OMP: Warning #181: OMP_PROC_BIND: ignored because KMP_AFFINITY has been defined

OPENMP DISPLAY ENVIRONMENT BEGIN
   _OPENMP='201511'
  [host] OMP_CANCELLATION='FALSE'
  [host] OMP_DEFAULT_DEVICE='0'
  [host] OMP_DISPLAY_ENV='TRUE'
  [host] OMP_DYNAMIC='FALSE'
  [host] OMP_MAX_ACTIVE_LEVELS='2147483647'
  [host] OMP_MAX_TASK_PRIORITY='0'
  [host] OMP_NESTED='FALSE'
  [host] OMP_NUM_THREADS='4'
  [host] OMP_PLACES='cores'
  [host] OMP_PROC_BIND='intel'
  [host] OMP_SCHEDULE='static'
  [host] OMP_STACKSIZE='4M'
  [host] OMP_THREAD_LIMIT='2147483647'
  [host] OMP_WAIT_POLICY='PASSIVE'
OPENMP DISPLAY ENVIRONMENT END


OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Warning #181: OMP_PLACES: ignored because KMP_AFFINITY has been defined
OMP: Warning #181: OMP_PROC_BIND: ignored because KMP_AFFINITY has been defined

OPENMP DISPLAY ENVIRONMENT BEGIN
   _OPENMP='201511'
  [host] OMP_CANCELLATION='FALSE'
  [host] OMP_DEFAULT_DEVICE='0'
  [host] OMP_DISPLAY_ENV='TRUE'
  [host] OMP_DYNAMIC='FALSE'
  [host] OMP_MAX_ACTIVE_LEVELS='2147483647'
  [host] OMP_MAX_TASK_PRIORITY='0'
  [host] OMP_NESTED='FALSE'
  [host] OMP_NUM_THREADS='4'
  [host] OMP_PLACES='cores'
  [host] OMP_PROC_BIND='intel'
  [host] OMP_SCHEDULE='static'
  [host] OMP_STACKSIZE='4M'
  [host] OMP_THREAD_LIMIT='2147483647'
  [host] OMP_WAIT_POLICY='PASSIVE'
OPENMP DISPLAY ENVIRONMENT END


OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Warning #181: OMP_PLACES: ignored because KMP_AFFINITY has been defined
OMP: Warning #181: OMP_PROC_BIND: ignored because KMP_AFFINITY has been defined

OPENMP DISPLAY ENVIRONMENT BEGIN
   _OPENMP='201511'
  [host] OMP_CANCELLATION='FALSE'
  [host] OMP_DEFAULT_DEVICE='0'
  [host] OMP_DISPLAY_ENV='TRUE'
  [host] OMP_DYNAMIC='FALSE'
  [host] OMP_MAX_ACTIVE_LEVELS='2147483647'
  [host] OMP_MAX_TASK_PRIORITY='0'
  [host] OMP_NESTED='FALSE'
  [host] OMP_NUM_THREADS='4'
  [host] OMP_PLACES='cores'
  [host] OMP_PROC_BIND='intel'
  [host] OMP_SCHEDULE='static'
  [host] OMP_STACKSIZE='4M'
  [host] OMP_THREAD_LIMIT='2147483647'
  [host] OMP_WAIT_POLICY='PASSIVE'
OPENMP DISPLAY ENVIRONMENT END


OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Warning #181: OMP_PLACES: ignored because KMP_AFFINITY has been defined
OMP: Warning #181: OMP_PROC_BIND: ignored because KMP_AFFINITY has been defined

OPENMP DISPLAY ENVIRONMENT BEGIN
   _OPENMP='201511'
  [host] OMP_CANCELLATION='FALSE'
  [host] OMP_DEFAULT_DEVICE='0'
  [host] OMP_DISPLAY_ENV='TRUE'
  [host] OMP_DYNAMIC='FALSE'
  [host] OMP_MAX_ACTIVE_LEVELS='2147483647'
  [host] OMP_MAX_TASK_PRIORITY='0'
  [host] OMP_NESTED='FALSE'
  [host] OMP_NUM_THREADS='4'
  [host] OMP_PLACES='cores'
  [host] OMP_PROC_BIND='intel'
  [host] OMP_SCHEDULE='static'
  [host] OMP_STACKSIZE='4M'
  [host] OMP_THREAD_LIMIT='2147483647'
  [host] OMP_WAIT_POLICY='PASSIVE'
OPENMP DISPLAY ENVIRONMENT END


OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #156: KMP_AFFINITY: 32 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 16 cores/pkg x 1 threads/core (32 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1 
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 2 
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 3 
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 4 
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 5 
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 6 
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 7 
OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 0 core 8 
OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 0 core 9 
OMP: Info #171: KMP_AFFINITY: OS proc 10 maps to package 0 core 10 
OMP: Info #171: KMP_AFFINITY: OS proc 11 maps to package 0 core 11 
OMP: Info #171: KMP_AFFINITY: OS proc 12 maps to package 0 core 12 
OMP: Info #171: KMP_AFFINITY: OS proc 13 maps to package 0 core 13 
OMP: Info #171: KMP_AFFINITY: OS proc 14 maps to package 0 core 14 
OMP: Info #171: KMP_AFFINITY: OS proc 15 maps to package 0 core 15 
OMP: Info #171: KMP_AFFINITY: OS proc 16 maps to package 1 core 0 
OMP: Info #171: KMP_AFFINITY: OS proc 17 maps to package 1 core 1 
OMP: Info #171: KMP_AFFINITY: OS proc 18 maps to package 1 core 2 
OMP: Info #171: KMP_AFFINITY: OS proc 19 maps to package 1 core 3 
OMP: Info #171: KMP_AFFINITY: OS proc 20 maps to package 1 core 4 
OMP: Info #171: KMP_AFFINITY: OS proc 21 maps to package 1 core 5 
OMP: Info #171: KMP_AFFINITY: OS proc 22 maps to package 1 core 6 
OMP: Info #171: KMP_AFFINITY: OS proc 23 maps to package 1 core 7 
OMP: Info #171: KMP_AFFINITY: OS proc 24 maps to package 1 core 8 
OMP: Info #171: KMP_AFFINITY: OS proc 25 maps to package 1 core 9 
OMP: Info #171: KMP_AFFINITY: OS proc 26 maps to package 1 core 10 
OMP: Info #171: KMP_AFFINITY: OS proc 27 maps to package 1 core 11 
OMP: Info #171: KMP_AFFINITY: OS proc 28 maps to package 1 core 12 
OMP: Info #171: KMP_AFFINITY: OS proc 29 maps to package 1 core 13 
OMP: Info #171: KMP_AFFINITY: OS proc 30 maps to package 1 core 14 
OMP: Info #171: KMP_AFFINITY: OS proc 31 maps to package 1 core 15 
OMP: Info #242: KMP_AFFINITY: pid 11656 thread 0 bound to OS proc set {0}
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #156: KMP_AFFINITY: 32 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 16 cores/pkg x 1 threads/core (32 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1 
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 2 
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 3 
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 4 
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 5 
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 6 
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 7 
OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 0 core 8 
OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 0 core 9 
OMP: Info #242: KMP_AFFINITY: pid 11656 thread 1 bound to OS proc set {1}
OMP: Info #171: KMP_AFFINITY: OS proc 10 maps to package 0 core 10 
OMP: Info #171: KMP_AFFINITY: OS proc 11 maps to package 0 core 11 
OMP: Info #171: KMP_AFFINITY: OS proc 12 maps to package 0 core 12 
OMP: Info #171: KMP_AFFINITY: OS proc 13 maps to package 0 core 13 
OMP: Info #171: KMP_AFFINITY: OS proc 14 maps to package 0 core 14 
OMP: Info #171: KMP_AFFINITY: OS proc 15 maps to package 0 core 15 
OMP: Info #171: KMP_AFFINITY: OS proc 16 maps to package 1 core 0 
OMP: Info #242: KMP_AFFINITY: pid 11656 thread 2 bound to OS proc set {2}
OMP: Info #171: KMP_AFFINITY: OS proc 17 maps to package 1 core 1 
OMP: Info #171: KMP_AFFINITY: OS proc 18 maps to package 1 core 2 
OMP: Info #171: KMP_AFFINITY: OS proc 19 maps to package 1 core 3 
OMP: Info #171: KMP_AFFINITY: OS proc 20 maps to package 1 core 4 
OMP: Info #171: KMP_AFFINITY: OS proc 21 maps to package 1 core 5 
OMP: Info #171: KMP_AFFINITY: OS proc 22 maps to package 1 core 6 
OMP: Info #242: KMP_AFFINITY: pid 11656 thread 3 bound to OS proc set {3}
OMP: Info #171: KMP_AFFINITY: OS proc 23 maps to package 1 core 7 
OMP: Info #171: KMP_AFFINITY: OS proc 24 maps to package 1 core 8 
OMP: Info #171: KMP_AFFINITY: OS proc 25 maps to package 1 core 9 
OMP: Info #171: KMP_AFFINITY: OS proc 26 maps to package 1 core 10 
OMP: Info #171: KMP_AFFINITY: OS proc 27 maps to package 1 core 11 
OMP: Info #171: KMP_AFFINITY: OS proc 28 maps to package 1 core 12 
OMP: Info #171: KMP_AFFINITY: OS proc 29 maps to package 1 core 13 
OMP: Info #171: KMP_AFFINITY: OS proc 30 maps to package 1 core 14 
OMP: Info #171: KMP_AFFINITY: OS proc 31 maps to package 1 core 15 
OMP: Info #242: KMP_AFFINITY: pid 8976 thread 0 bound to OS proc set {0}
OMP: Info #242: KMP_AFFINITY: pid 8976 thread 1 bound to OS proc set {1}
OMP: Info #242: KMP_AFFINITY: pid 8976 thread 2 bound to OS proc set {2}
OMP: Info #242: KMP_AFFINITY: pid 8976 thread 3 bound to OS proc set {3}
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #156: KMP_AFFINITY: 32 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 16 cores/pkg x 1 threads/core (32 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1 
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 2 
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 3 
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 4 
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 5 
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 6 
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 7 
OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 0 core 8 
OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 0 core 9 
OMP: Info #171: KMP_AFFINITY: OS proc 10 maps to package 0 core 10 
OMP: Info #171: KMP_AFFINITY: OS proc 11 maps to package 0 core 11 
OMP: Info #171: KMP_AFFINITY: OS proc 12 maps to package 0 core 12 
OMP: Info #171: KMP_AFFINITY: OS proc 13 maps to package 0 core 13 
OMP: Info #171: KMP_AFFINITY: OS proc 14 maps to package 0 core 14 
OMP: Info #171: KMP_AFFINITY: OS proc 15 maps to package 0 core 15 
OMP: Info #171: KMP_AFFINITY: OS proc 16 maps to package 1 core 0 
OMP: Info #171: KMP_AFFINITY: OS proc 17 maps to package 1 core 1 
OMP: Info #171: KMP_AFFINITY: OS proc 18 maps to package 1 core 2 
OMP: Info #171: KMP_AFFINITY: OS proc 19 maps to package 1 core 3 
OMP: Info #171: KMP_AFFINITY: OS proc 20 maps to package 1 core 4 
OMP: Info #171: KMP_AFFINITY: OS proc 21 maps to package 1 core 5 
OMP: Info #171: KMP_AFFINITY: OS proc 22 maps to package 1 core 6 
OMP: Info #171: KMP_AFFINITY: OS proc 23 maps to package 1 core 7 
OMP: Info #171: KMP_AFFINITY: OS proc 24 maps to package 1 core 8 
OMP: Info #171: KMP_AFFINITY: OS proc 25 maps to package 1 core 9 
OMP: Info #171: KMP_AFFINITY: OS proc 26 maps to package 1 core 10 
OMP: Info #171: KMP_AFFINITY: OS proc 27 maps to package 1 core 11 
OMP: Info #171: KMP_AFFINITY: OS proc 28 maps to package 1 core 12 
OMP: Info #171: KMP_AFFINITY: OS proc 29 maps to package 1 core 13 
OMP: Info #171: KMP_AFFINITY: OS proc 30 maps to package 1 core 14 
OMP: Info #171: KMP_AFFINITY: OS proc 31 maps to package 1 core 15 
OMP: Info #242: KMP_AFFINITY: pid 12472 thread 0 bound to OS proc set {0}
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31}
OMP: Info #156: KMP_AFFINITY: 32 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 16 cores/pkg x 1 threads/core (32 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1 
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 2 
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 3 
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 4 
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 5 
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 6 
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 7 
OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 0 core 8 
OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 0 core 9 
OMP: Info #171: KMP_AFFINITY: OS proc 10 maps to package 0 core 10 
OMP: Info #171: KMP_AFFINITY: OS proc 11 maps to package 0 core 11 
OMP: Info #171: KMP_AFFINITY: OS proc 12 maps to package 0 core 12 
OMP: Info #171: KMP_AFFINITY: OS proc 13 maps to package 0 core 13 
OMP: Info #171: KMP_AFFINITY: OS proc 14 maps to package 0 core 14 
OMP: Info #171: KMP_AFFINITY: OS proc 15 maps to package 0 core 15 
OMP: Info #171: KMP_AFFINITY: OS proc 16 maps to package 1 core 0 
OMP: Info #171: KMP_AFFINITY: OS proc 17 maps to package 1 core 1 
OMP: Info #171: KMP_AFFINITY: OS proc 18 maps to package 1 core 2 
OMP: Info #171: KMP_AFFINITY: OS proc 19 maps to package 1 core 3 
OMP: Info #171: KMP_AFFINITY: OS proc 20 maps to package 1 core 4 
OMP: Info #171: KMP_AFFINITY: OS proc 21 maps to package 1 core 5 
OMP: Info #171: KMP_AFFINITY: OS proc 22 maps to package 1 core 6 
OMP: Info #171: KMP_AFFINITY: OS proc 23 maps to package 1 core 7 
OMP: Info #171: KMP_AFFINITY: OS proc 24 maps to package 1 core 8 
OMP: Info #171: KMP_AFFINITY: OS proc 25 maps to package 1 core 9 
OMP: Info #171: KMP_AFFINITY: OS proc 26 maps to package 1 core 10 
OMP: Info #171: KMP_AFFINITY: OS proc 27 maps to package 1 core 11 
OMP: Info #171: KMP_AFFINITY: OS proc 28 maps to package 1 core 12 
OMP: Info #171: KMP_AFFINITY: OS proc 29 maps to package 1 core 13 
OMP: Info #171: KMP_AFFINITY: OS proc 30 maps to package 1 core 14 
OMP: Info #171: KMP_AFFINITY: OS proc 31 maps to package 1 core 15 
OMP: Info #242: KMP_AFFINITY: pid 11428 thread 0 bound to OS proc set {0}
OMP: Info #242: KMP_AFFINITY: pid 12472 thread 1 bound to OS proc set {1}
OMP: Info #242: KMP_AFFINITY: pid 12472 thread 2 bound to OS proc set {2}
OMP: Info #242: KMP_AFFINITY: pid 12472 thread 3 bound to OS proc set {3}
OMP: Info #242: KMP_AFFINITY: pid 11428 thread 1 bound to OS proc set {1}
OMP: Info #242: KMP_AFFINITY: pid 11428 thread 2 bound to OS proc set {2}
OMP: Info #242: KMP_AFFINITY: pid 11428 thread 3 bound to OS proc set {3}

HT vers 0.1
OMP THREADS = 4 
TIME TO SOLUTION (sec)  = 3.834136e-01 
END HT

HT vers 0.1
OMP THREADS = 4 
TIME TO SOLUTION (sec)  = 7.583328e-01 
END HT.xx

HT vers 0.1
OMP THREADS = 4 
TIME TO SOLUTION (sec)  = 7.535792e-01 
END HT.xx

HT vers 0.1
OMP THREADS = 4 
TIME TO SOLUTION (sec)  = 9.195819e-01 
END HT.xx

and If i run with only 1 mpi task I obtain:

HT vers 0.1
OMP THREADS = 4 
TIME TO SOLUTION (sec)  = 3.982187e-01 
END HT

What is wrong in my configuration?

Thanks

Eric

Eric,

jimdempseyatthecove — Fri, 16 Dec 2016 15:58:19 GMT

Eric,

I am a noob on placement, here is my suggestion

# remove any environment "global" variables relating to affinity
unset OMP_DISPLAY_ENV
unset OMP_NUM_THREADS
unset KMP_AFFINITY
unset OMP_PLACES
unset OMP_PROC_BIND

echo Run with default bindings, no affinity
mpiexec -np 4 -genv OMP_NUM_THREADS 4 ht.exe input.txt 4

echo 1 hw thread per place
mpiexec ht -env OMP_PLACES "{0:1}:4:8" ht.exe input.txt 4 : -env OMP_PLACES "{2:1}:4:8" ht.exe input.txt 4 : -env OMP_PLACES "{4:1}:4:8" ht.exe input.txt 4 : -env OMP_PLACES "{6:1}:4:8" ht.exe input.txt 4

echo 2 hw threads per place
mpiexec ht -env OMP_PLACES "{0:2}:4:8" ht.exe input.txt 4 : -env OMP_PLACES "{2:2}:4:8" ht.exe input.txt 4 : -env OMP_PLACES "{4:2}:4:8" ht.exe input.txt 4 : -env OMP_PLACES "{6:2}:4:8" ht.exe input.txt 4

IOW remove any and all globally set environment variables that are related to affinity placement and number of threads.

Then for each test run specify the places, that provide a desirable distribution.

*** Note, the above is specific to your system configuration. There is likely an mpiexec generic settings that will permit you to use "-np 4" .AND. distribute the affinities equally.

Note 2: the first mpiexec above with defaults, should affinity pin each process to a different subset of total number of hardware threads / 4. Then each process will run OpenMP with 4 software threads within the process restricted affinity. You can experiment with adding KMP_AFFINITY variations of the first mpiexe line above however note that OMP_PLACES and KMP_AFFINITY are mutually exclusive.

Jim Dempsey

Hi JIm, thanks to your reply.

eric_p_ — Thu, 29 Dec 2016 10:07:33 GMT

Hi JIm, thanks to your reply.

I don't kwon why but the problem is related to mpirun command. I introduce in my software MPI_Init and MPI_Finalize and include mpi.h in order to obtain information with I_MPI_DEBUG to analyze thread and task distribution among the cores. Probably with MPI init, the mpirun create the correct distribution among the physical core.

[0] MPI startup(): Intel(R) MPI Library, Version 2017 Update 1  Build 20161016
[0] MPI startup(): Copyright (C) 2003-2016 Intel Corporation.  All rights reserved.
[0] MPI startup(): Multi-threaded optimized library
[2] MPI startup(): shm data transfer mode
[3] MPI startup(): shm data transfer mode
[0] MPI startup(): shm data transfer mode
[1] MPI startup(): shm data transfer mode
[0] MPI startup(): Internal info: pinning initialization was done
[1] MPI startup(): Internal info: pinning initialization was done
[2] MPI startup(): Internal info: pinning initialization was done
[3] MPI startup(): Internal info: pinning initialization was done
[0] MPI startup(): Device_reset_idx=8
[0] MPI startup(): Allgather: 3: 0-0 & 0-2147483647
[0] MPI startup(): Allgather: 1: 1-337 & 0-2147483647
[0] MPI startup(): Allgather: 5: 338-877 & 0-2147483647
[0] MPI startup(): Allgather: 1: 878-2559 & 0-2147483647
[0] MPI startup(): Allgather: 5: 2560-6359 & 0-2147483647
[0] MPI startup(): Allgather: 1: 6360-13712 & 0-2147483647
[0] MPI startup(): Allgather: 3: 13713-32768 & 0-2147483647
[0] MPI startup(): Allgather: 1: 32769-2150688 & 0-2147483647
[0] MPI startup(): Allgather: 5: 0-2147483647 & 0-2147483647
[0] MPI startup(): Allgatherv: 1: 0-2048 & 0-2147483647
[0] MPI startup(): Allgatherv: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Allreduce: 7: 0-0 & 0-2147483647
[0] MPI startup(): Allreduce: 12: 1-5 & 0-2147483647
[0] MPI startup(): Allreduce: 11: 6-11 & 0-2147483647
[0] MPI startup(): Allreduce: 10: 12-26 & 0-2147483647
[0] MPI startup(): Allreduce: 11: 27-76 & 0-2147483647
[0] MPI startup(): Allreduce: 10: 77-256 & 0-2147483647
[0] MPI startup(): Allreduce: 11: 257-658 & 0-2147483647
[0] MPI startup(): Allreduce: 10: 659-1815 & 0-2147483647
[0] MPI startup(): Allreduce: 1: 1816-7645 & 0-2147483647
[0] MPI startup(): Allreduce: 7: 7646-16384 & 0-2147483647
[0] MPI startup(): Allreduce: 2: 16385-47972 & 0-2147483647
[0] MPI startup(): Allreduce: 7: 47973-106959 & 0-2147483647
[0] MPI startup(): Allreduce: 2: 106960-316569 & 0-2147483647
[0] MPI startup(): Allreduce: 7: 316570-590413 & 0-2147483647
[0] MPI startup(): Allreduce: 2: 590414-1473386 & 0-2147483647
[0] MPI startup(): Allreduce: 7: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoall: 1: 0-0 & 0-2147483647
[0] MPI startup(): Alltoall: 3: 1-1 & 0-2147483647
[0] MPI startup(): Alltoall: 4: 2-2 & 0-2147483647
[0] MPI startup(): Alltoall: 3: 3-16 & 0-2147483647
[0] MPI startup(): Alltoall: 4: 17-32 & 0-2147483647
[0] MPI startup(): Alltoall: 3: 33-4096 & 0-2147483647
[0] MPI startup(): Alltoall: 2: 4097-8192 & 0-2147483647
[0] MPI startup(): Alltoall: 3: 8193-16384 & 0-2147483647
[0] MPI startup(): Alltoall: 2: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoallv: 1: 0-2147483647 & 0-2147483647
[0] MPI startup(): Alltoallw: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Barrier: 7: 0-2147483647 & 0-2147483647
[0] MPI startup(): Bcast: 3: 0-0 & 0-2147483647
[0] MPI startup(): Bcast: 9: 1-3290 & 0-2147483647
[0] MPI startup(): Bcast: 8: 3291-11498 & 0-2147483647
[0] MPI startup(): Bcast: 1: 11499-48691 & 0-2147483647
[0] MPI startup(): Bcast: 7: 48692-524288 & 0-2147483647
[0] MPI startup(): Bcast: 0: 524289-2097152 & 0-2147483647
[0] MPI startup(): Bcast: 7: 0-2147483647 & 0-2147483647
[0] MPI startup(): Exscan: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Gather: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Gatherv: 1: 0-2147483647 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 4: 0-0 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 1: 1-15 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 3: 16-65536 & 0-2147483647
[0] MPI startup(): Reduce_scatter: 2: 0-2147483647 & 0-2147483647
[0] MPI startup(): Reduce: 5: 0-0 & 0-2147483647
[0] MPI startup(): Reduce: 10: 1-44 & 0-2147483647
[0] MPI startup(): Reduce: 8: 45-87 & 0-2147483647
[0] MPI startup(): Reduce: 10: 88-230 & 0-2147483647
[0] MPI startup(): Reduce: 9: 231-563 & 0-2147483647
[0] MPI startup(): Reduce: 10: 564-2153 & 0-2147483647
[0] MPI startup(): Reduce: 8: 2154-6802 & 0-2147483647
[0] MPI startup(): Reduce: 10: 6803-8701 & 0-2147483647
[0] MPI startup(): Reduce: 5: 8702-74514 & 0-2147483647
[0] MPI startup(): Reduce: 1: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scan: 0: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scatter: 1: 0-1785 & 0-2147483647
[0] MPI startup(): Scatter: 3: 1786-21541 & 0-2147483647
[0] MPI startup(): Scatter: 1: 21542-50077 & 0-2147483647
[0] MPI startup(): Scatter: 3: 0-2147483647 & 0-2147483647
[0] MPI startup(): Scatterv: 0: 0-2147483647 & 0-2147483647
[1] MPI startup(): Recognition=2 Platform(code=128 ippn=2 dev=1) Fabric(intra=1 inter=1 flags=0x0)[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       13972    W2K12R2RT  {0,1,2,3,4,5,6,7}
[0] MPI startup(): 1       9524     W2K12R2RT  {8,9,10,11,12,13,14,15}
[0] MPI startup(): 2       14128    W2K12R2RT  {16,17,18,19,20,21,22,23}
[0] MPI startup(): 3       13388    W2K12R2RT  {24,25,26,27,28,29,30,31}
[0] MPI startup(): Recognition=2 Platform(code=128 ippn=2 dev=1) Fabric(intra=1 inter=1 flags=0x0)
[2] MPI startup(): Recognition=2 Platform(code=128 ippn=2 dev=1) Fabric(intra=1 inter=1 flags=0x0)
[3] MPI startup(): Recognition=2 Platform(code=128 ippn=2 dev=1) Fabric(intra=1 inter=1 flags=0x0)
[0] MPI startup(): I_MPI_DEBUG=7
[0] MPI startup(): I_MPI_PIN_MAPPING=4:0 0,1 8,2 16,3 24
[0] MPI startup(): OMP_NUM_THREADS=4

HT vers 0.1
OMP THREADS = 4 
TIME TO SOLUTION (sec)  = 2.840791e-01 
END HT.xx

HT vers 0.1
OMP THREADS = 4  
TIME TO SOLUTION (sec)  = 2.847850e-01 
END HT.xx

HT vers 0.1
OMP THREADS = 4 
TIME TO SOLUTION (sec)  = 2.836015e-01 
END HT.xx

HT vers 0.1
OMP THREADS = 4 
TIME TO SOLUTION (sec)  = 2.823817e-01 
END HT.xx

and with only 1 MPI task I obtain:

HT vers 0.1
OMP THREADS = 4 
TIME TO SOLUTION (sec)  = 2.423817e-01 
END HT.xx

So, with inclusion of MPI library in the software the Windows results are compatible with the Linux results. I think probably there are a configuration problem on Windows workstation.

Thanks again

Eric