Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

MPI+OPENMP does not speed up

Siwei_D_
Beginner
2,228 Views

Hi everyone

I need help for a problem that bothers me several days.

When I use only OpenMP, the code works fine, the time for 4 threads is 70s and 136s for 8 threads.

However, when I use MPI+OpenMP, I found the code did not spped up, i.e. the time for threads=1 and 8 are the same!!!

I am using intel fortran I compiled it in this way: mpif90 -openmp -check all -hybpi.f90 -o hybpi

I also uploaded my codes, they are very easy, just to compute the value of PI.  And I also pate the PBS script for MPI+OpenMP

!/bin/sh -e
#PBS -N thread8_hybpi
#PBS -e out.err
#PBS -o out.out
#PBS -l walltime=2:00:00,nodes=2:ppn=12:nogpu
#PBS -k oe

cd $PBS_O_WORKDIR
cat $PBS_NODEFILE > nodefile
cat $PBS_NODEFILE | uniq > ./mpd_nodefile_$USER
export NPROCS=`wc -l mpd_nodefile_$USER |gawk '//{print $1}'`
export OMP_NUM_THREADS=8
ulimit -s hard

WORKDIR=/home/siwei/module

cd $WORKDIR

MPIEXEC="/apps/mvapich2-1.7-r5123/build-intel/bin/mpirun"
mpiexec -machinefile mpd_nodefile_$USER -np $NPROCS /bin/env \
OMP_NUM_THREADS=8 ./hybpi

I hope someone can help me!

Thanks in advance!!!

Siwei

0 Kudos
10 Replies
TimP
Honored Contributor III
2,227 Views

If you run multiple MPI processes per node, you probably need to arrange so that each MPI process gets a distinct list of cores in KMP_AFFINITY.  If your application uses all cores effectively with MPI alone, it probably won't do any better with a combination of MPI and OpenMP..

If you have 12 cores per 2 CPU node, you should try:

1 process per node x 8 and 12 OpenMP threads

2 processes affinitized by CPU, 4 and 6 threads per

4 proc, 3 threads per

6 proc, 2 threads per

In my work on Westmere, the cases with 2 and 3 threads per MPI rank worked best.

It's nearly certain that having multiple MPI ranks competing to run their threads on the same cores will degrade performance.  If you had an application set up to benefit from HyperThreading, you would still need to assure that the correct pairs of threads from a single rank land on each core.

0 Kudos
Siwei_D_
Beginner
2,228 Views

Hi Tim

Thank you so much for your reply so fast!!

In my cluster 1 node has two sockets, each socket has 6 cores, so 1 node has 12 cores.

When I use hybrid MPI+OpenMP using 2 nodes, 8 threads. I just let 8 cores in each node work

I don't know how to use the KMP_AFFINITY you said.

Could you tell me detail for example how to do 4 proc, 3 threads per you mentioned?

0 Kudos
Siwei_D_
Beginner
2,228 Views

Hi Tim

I think I did not make it clear. I think that I did not make multiple MPI ranks competng their threads on the same core.

In the PBS script. I use 2 nodes(2 MPI tasks or 2 MPI ranks) with each node 8 threads(one thread one core), so their is no competing in the same core.

0 Kudos
TimP
Honored Contributor III
2,228 Views

If you are using just 1 MPI process per node, then it's fairly easy; simply set appropriate KMP_AFFINITY environment variable, same for each node, according to whether you have HT enabled.

If it's a Westmere, you may have to deal with its peculiarity; the optimum way to run 8 threads per node is with 1 thread per L3 connection, recognizing that the first 2 pairs of cores share cache connections.  But for a start, you must recognize that it's important to use the thread pinning of your OpenMP.  If you don't, your MPI job will be paced by the worst accidental thread placement of either node.   You might start by reading the documentation which comes with the compiler.

0 Kudos
Siwei_D_
Beginner
2,228 Views

TimP (Intel) wrote:

If you are using just 1 MPI process per node, then it's fairly easy; simply set appropriate KMP_AFFINITY environment variable, same for each node, according to whether you have HT enabled.

If it's a Westmere, you may have to deal with its peculiarity; the optimum way to run 8 threads per node is with 1 thread per L3 connection, recognizing that the first 2 pairs of cores share cache connections.  But for a start, you must recognize that it's important to use the thread pinning of your OpenMP.  If you don't, your MPI job will be paced by the worst accidental thread placement of either node.   You might start by reading the documentation which comes with the compiler.

Hi Tim 

Sorry I do not understand the AFFINITY very well.

I do not understand this. 

I am using: export KMP_AFFINITY=verbose,granularity=thread,proclist=[0,1,2,3],explicit

Using 2 nodes, each node has 8 cores(without hyper thread)

OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.

OMP: Warning #205: KMP_AFFINITY: cpuid leaf 11 not supported - decoding legacy APIC ids.
OMP: Info #149: KMP_AFFINITY: Affinity capable, using global cpuid info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0}
OMP: Info #156: KMP_AFFINITY: 1 available OS procs
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Warning #205: KMP_AFFINITY: cpuid leaf 11 not supported - decoding legacy APIC ids.
OMP: Info #159: KMP_AFFINITY: 1 packages x 1 cores/pkg x 1 threads/core (1 total cores)
OMP: Info #149: KMP_AFFINITY: Affinity capable, using global cpuid info
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0}
OMP: Info #156: KMP_AFFINITY: 1 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0
OMP: Warning #123: Ignoring invalid OS proc ID 1.
OMP: Warning #123: Ignoring invalid OS proc ID 2.
OMP: Warning #123: Ignoring invalid OS proc ID 3.
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0}
OMP: Info #159: KMP_AFFINITY: 1 packages x 1 cores/pkg x 1 threads/core (1 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0
OMP: Warning #123: Ignoring invalid OS proc ID 1.
OMP: Warning #123: Ignoring invalid OS proc ID 2.
OMP: Warning #123: Ignoring invalid OS proc ID 3.
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0}
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {0}
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {0}
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {0}
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {0}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {0}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {0}

0 Kudos
TimP
Honored Contributor III
2,228 Views

I'm not familiar enough with the looks of KMP_AFFINITY verbose on non-Intel CPUs (if that's what you have).

Is it correct that you have a single 8-core non-Intel CPU on each node?  If so, you may need to try the effect of various choices for using 4 of those cores, such as

KMP_AFFINITY="proclist=[1,3,5,7],explicit,verbose"

which, on an Intel CPU, ought to come out the same as

KMP_AFFINITY=scatter,1,1,verbose

In my experience, you always need the ""  around a proclist, presumably on account of the embedded punctuation.

A reason for trying the odd numbered cores might be that your system assigns interrupts to even numbered ones.  I have no idea if this might be true of AMD, or why you decline to identify yours.

Alternatively, if your CPU shares cache between even-odd core pairs, and your application doesn't need to be spread across all of cache, the choice you suggested may be best, if you get the syntax right.

If you assign 8 threads per node, but restrict them to 4 cores by an affinity setting such as you suggest, you will likely get worse results than without affinity setting.  proclist=[0-7] would request use of all 8 cores in order.

0 Kudos
Siwei_D_
Beginner
2,228 Views

TimP (Intel) wrote:

If you are using just 1 MPI process per node, then it's fairly easy; simply set appropriate KMP_AFFINITY environment variable, same for each node, according to whether you have HT enabled.

If it's a Westmere, you may have to deal with its peculiarity; the optimum way to run 8 threads per node is with 1 thread per L3 connection, recognizing that the first 2 pairs of cores share cache connections.  But for a start, you must recognize that it's important to use the thread pinning of your OpenMP.  If you don't, your MPI job will be paced by the worst accidental thread placement of either node.   You might start by reading the documentation which comes with the compiler.

Hi Tim

I am sorry for troubling you agian.

I read the output when using KMP_AFFINITY above, the problem is that I can only see one core, which is really wierd.

As I has 8 cores per node.

0 Kudos
Siwei_D_
Beginner
2,228 Views

TimP (Intel) wrote:

I'm not familiar enough with the looks of KMP_AFFINITY verbose on non-Intel CPUs (if that's what you have).

Is it correct that you have a single 8-core non-Intel CPU on each node?  If so, you may need to try the effect of various choices for using 4 of those cores, such as

KMP_AFFINITY="proclist=[1,3,5,7],explicit,verbose"

which, on an Intel CPU, ought to come out the same as

KMP_AFFINITY=scatter,1,1,verbose

In my experience, you always need the ""  around a proclist, presumably on account of the embedded punctuation.

A reason for trying the odd numbered cores might be that your system assigns interrupts to even numbered ones.  I have no idea if this might be true of AMD, or why you decline to identify yours.

Alternatively, if your CPU shares cache between even-odd core pairs, and your application doesn't need to be spread across all of cache, the choice you suggested may be best, if you get the syntax right.

If you assign 8 threads per node, but restrict them to 4 cores by an affinity setting such as you suggest, you will likely get worse results than without affinity setting.  proclist=[0-7] would request use of all 8 cores in order.

it is Intel(R) Xeon(R) CPU  E5462  @ 2.80GHz

0 Kudos
zhubq
Beginner
2,228 Views

I got the same problem when running  through PBS script, i.e. number of processors to the second mpi process is not 4 but 1.

#PBS -l nodes=1:ppn=8

export OMP_NUM_THREADS=4
export I_MPI_PIN_DOMAIN=omp

export KMP_AFFINITY=verbose

mpirun -np 2 ./a.out 

OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {4,5,6,7}
OMP: Info #156: KMP_AFFINITY: 4 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 2 cores/pkg x 1 threads/core (4 total cores)
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {4,5,6,7}
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {7}
OMP: Info #156: KMP_AFFINITY: 1 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #159: KMP_AFFINITY: 1 packages x 1 cores/pkg x 1 threads/core (1 total cores)
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {7}
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {7}
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {7}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {7}
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {4,5,6,7}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {4,5,6,7}
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {4,5,6,7}

But if I do the same interactively, there is no such problem.

OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,12,13}
OMP: Info #156: KMP_AFFINITY: 4 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 2 cores/pkg x 2 threads/core (2 total cores)
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1,12,13}
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {2,3,14,15}
OMP: Info #156: KMP_AFFINITY: 4 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 2 cores/pkg x 2 threads/core (2 total cores)
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {2,3,14,15}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {0,1,12,13}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {2,3,14,15}
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {0,1,12,13}
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {0,1,12,13}
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {2,3,14,15}

OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {2,3,14,15}

0 Kudos
Siwei_D_
Beginner
2,228 Views

Hi

I do not remenber it correctly but it seems that it was due to my MPI.My MPICH did not support OPENMP, but then I used mvapich and then the problem is solved. So try to use MPICH MVAPICH OPENMP to see the difference, good luck!

zhubq wrote:

I got the same problem when running  through PBS script, i.e. number of processors to the second mpi process is not 4 but 1.

#PBS -l nodes=1:ppn=8

export OMP_NUM_THREADS=4
export I_MPI_PIN_DOMAIN=omp

export KMP_AFFINITY=verbose

mpirun -np 2 ./a.out 

OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {4,5,6,7}
OMP: Info #156: KMP_AFFINITY: 4 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 2 cores/pkg x 1 threads/core (4 total cores)
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {4,5,6,7}
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {7}
OMP: Info #156: KMP_AFFINITY: 1 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #159: KMP_AFFINITY: 1 packages x 1 cores/pkg x 1 threads/core (1 total cores)
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {7}
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {7}
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {7}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {7}
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {4,5,6,7}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {4,5,6,7}
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {4,5,6,7}

But if I do the same interactively, there is no such problem.

OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,12,13}
OMP: Info #156: KMP_AFFINITY: 4 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 2 cores/pkg x 2 threads/core (2 total cores)
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1,12,13}
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {2,3,14,15}
OMP: Info #156: KMP_AFFINITY: 4 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 2 cores/pkg x 2 threads/core (2 total cores)
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {2,3,14,15}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {0,1,12,13}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {2,3,14,15}
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {0,1,12,13}
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {0,1,12,13}
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {2,3,14,15}

OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {2,3,14,15}

0 Kudos
Reply