OpenMP vs MPI

amit-amritkar · ‎08-21-2009

I have an MPI-OpenMP Hybrid FORTRAN code. Its a legacy code with over 300,000 lines of code, written to do CFD computations.

The openmp runs need more time to execute with increased number of OMP threads where as MPI executions scale almost perfectly till 32 processors. I use -openmp -r8 -O3 flags to compile.

Any idea on what could be going wrong? Am I missing some optimization flags? How do I check & improve the code performance in OpenMP?

The command that I use to run the code is mpirun -np 1 dplace -s1 -x2 ./executable.x. (I vary the OMP_NUM_THREADS environmental variable) Also, the dplace tool also doesn't seem to make any difference in the runs.

TimP · ‎08-21-2009

Guessing that you may be running on a simulated shared memory Altix IA64 machine, there is still a reason for your application to be written as hybrid. You would expect to find optimum performance with some strategy such as a single MPI process per brick, with the number of threads per MPI process depending on the number of cores per brick. The idea of hybrid is to keep threads which need shared memory communication local to a true memory bus while limiting and optimizing the communication which has to go out on backplane.
This gets beyond the reach of the Fortran forum. If you are using Intel MPI, the HPC forum would be appropriate. As the details depend on which MPI you use, you may have to get in touch with former SGI people, or someone running the same application on similar platform, to get advice on other MPI, such as MPT. Are you running a museum?

amit-amritkar · ‎08-21-2009

Quoting - tim18

Guessing that you may be running on a simulated shared memory Altix IA64 machine, there is still a reason for your application to be written as hybrid. You would expect to find optimum performance with some strategy such as a single MPI process per brick, with the number of threads per MPI process depending on the number of cores per brick. The idea of hybrid is to keep threads which need shared memory communication local to a true memory bus while limiting and optimizing the communication which has to go out on backplane.
This gets beyond the reach of the Fortran forum. If you are using Intel MPI, the HPC forum would be appropriate. As the details depend on which MPI you use, you may have to get in touch with former SGI people, or someone running the same application on similar platform, to get advice on other MPI, such as MPT. Are you running a museum?

Tim,

You guessed it correctly. I am running my code on an SGI Altix IA64 machine. The described idea to have the hybrid code is also right.
I use MPT and I am looking for tools that will help me analyze the code and debug if anything is incorrect so that the OpenMP performance is also at par with MPI. I'd also like to hear from anyone else if they had such issues in their runs. btw, the code is actually well designed with features like cache optimizations etc. :)

Thanks.

TimP · ‎08-21-2009

You have at most 2 cores sharing cache on IA64, so cache optimizations can extend only to pairs of threads. You have at most 8 cores per motherboard, with fast bus connection, so you can't expect OpenMP performance to match performance of a well implemented MPI application beyond those 8 threads (or whatever number is available on your machine). You would look for the suitable groups of cores for each MPI process, communicating by OpenMP within the process. The fact that someone has gone to the trouble to build a hybrid application would indicate that the advantage of such an organization was recognized some time back.

Mike_Rezny · ‎08-23-2009

Quoting - Amit

Tim,

You guessed it correctly. I am running my code on an SGI Altix IA64 machine. The described idea to have the hybrid code is also right.
I use MPT and I am looking for tools that will help me analyze the code and debug if anything is incorrect so that the OpenMP performance is also at par with MPI. I'd also like to hear from anyone else if they had such issues in their runs. btw, the code is actually well designed with features like cache optimizations etc. :)

Thanks.

Hi,
the first step is to enable some diagnostics.
The very first is to see how the OpenMP threads are being mapped to cores.
A good start is to set the following environment variable:
setenv KMP_AFFINITY verbose, none (or howver you do it in the shell you are using)

Then you should see something like this in the output:

OMP: Info #148: KMP_AFFINITY: Affinity capable, using cpuinfo file
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {40,41,42,43,44,45,46,47}
OMP: Info #156: KMP_AFFINITY: 8 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 4 packages x 2 cores/pkg x 1 threads/core (8 total cores)
OMP: Info #160: KMP_AFFINITY: OS proc to physical thread map ([] => level not in map):
OMP: Info #171: KMP_AFFINITY: OS proc 40 maps to package 5120 core 0 [thread 0 ]
OMP: Info #171: KMP_AFFINITY: OS proc 41 maps to package 5120 core 1 [thread 0 ]
OMP: Info #171: KMP_AFFINITY: OS proc 42 maps to package 5123 core 0 [thread 0 ]
OMP: Info #171: KMP_AFFINITY: OS proc 43 maps to package 5123 core 1 [thread 0 ]
OMP: Info #171: KMP_AFFINITY: OS proc 44 maps to package 5632 core 0 [thread 0 ]
OMP: Info #171: KMP_AFFINITY: OS proc 45 maps to package 5632 core 1 [thread 0 ]
OMP: Info #171: KMP_AFFINITY: OS proc 46 maps to package 5635 core 0 [thread 0 ]
OMP: Info #171: KMP_AFFINITY: OS proc 47 maps to package 5635 core 1 [thread 0 ]
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {40}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {42}
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {44}
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {46}
OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {41}
OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {47}
OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {43}
OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {45}

This clearly shows a one-to-one mapping between OpenMP threads and physicalk cores. This is good.

Some other questions:
what is the CFD application?
are you running the job standalone or via a job scheduler such as PBS?

regards
Mike

Mike_Rezny · ‎08-23-2009

Quoting - Mike Rezny

Hi,
the first step is to enable some diagnostics.
The very first is to see how the OpenMP threads are being mapped to cores.
A good start is to set the following environment variable:
setenv KMP_AFFINITY verbose, none (or howver you do it in the shell you are using)

Then you should see something like this in the output:

OMP: Info #148: KMP_AFFINITY: Affinity capable, using cpuinfo file
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {40,41,42,43,44,45,46,47}
OMP: Info #156: KMP_AFFINITY: 8 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 4 packages x 2 cores/pkg x 1 threads/core (8 total cores)
OMP: Info #160: KMP_AFFINITY: OS proc to physical thread map ([] => level not in map):
OMP: Info #171: KMP_AFFINITY: OS proc 40 maps to package 5120 core 0 [thread 0 ]
OMP: Info #171: KMP_AFFINITY: OS proc 41 maps to package 5120 core 1 [thread 0 ]
OMP: Info #171: KMP_AFFINITY: OS proc 42 maps to package 5123 core 0 [thread 0 ]
OMP: Info #171: KMP_AFFINITY: OS proc 43 maps to package 5123 core 1 [thread 0 ]
OMP: Info #171: KMP_AFFINITY: OS proc 44 maps to package 5632 core 0 [thread 0 ]
OMP: Info #171: KMP_AFFINITY: OS proc 45 maps to package 5632 core 1 [thread 0 ]
OMP: Info #171: KMP_AFFINITY: OS proc 46 maps to package 5635 core 0 [thread 0 ]
OMP: Info #171: KMP_AFFINITY: OS proc 47 maps to package 5635 core 1 [thread 0 ]
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {40}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {42}
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {44}
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {46}
OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {41}
OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {47}
OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {43}
OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {45}

This clearly shows a one-to-one mapping between OpenMP threads and physicalk cores. This is good.

Some other questions:
what is the CFD application?
are you running the job standalone or via a job scheduler such as PBS?

regards
Mike

Hi,
The second step is to start looking at how to place hybrid OpenMP/MPI jobs on the altix using the mpt MPI library.

For that you will need to read the MPT mpi man page, in particular, the section relating to Using MPI with OpenMP,
relating to using omplace with mpirun.

-v is also a good flag to use with mpirun.

regards
Mike

amit-amritkar · ‎08-25-2009

Quoting - Mike Rezny

Hi,
The second step is to start looking at how to place hybrid OpenMP/MPI jobs on the altix using the mpt MPI library.

For that you will need to read the MPT mpi man page, in particular, the section relating to Using MPI with OpenMP,
relating to using omplace with mpirun.

-v is also a good flag to use with mpirun.

regards
Mike

Mike,

Thanks for the suggestions.

I am trying out various KMP_AFFINITY settings but so far not much improvement in the performance. I also tried omplace instead of dplace I had used and there seems to be no difference at all.

I am currently working on improving the code performance as an OpenMP process, where I have multiple OpenMP threads and single MPI processor. (I have reconstructed the code to be one large OpenMP parallel region hoping to get performance equivalent to MPI but the scaling is totally absent)

I have access to various machines which include standalone machines and machines with PBS queuing.
I am not sure what exactly is happening but I am looking at using the tools from University of Houston (openuh).
I'll update this post once I get some solution.

Thanks,
Amit

amit-amritkar · ‎09-22-2009

Quoting - Mike Rezny

Hi,
The second step is to start looking at how to place hybrid OpenMP/MPI jobs on the altix using the mpt MPI library.

For that you will need to read the MPT mpi man page, in particular, the section relating to Using MPI with OpenMP,
relating to using omplace with mpirun.

-v is also a good flag to use with mpirun.

regards
Mike

Mike,

In the past few weeks I learned that OpenMP performance is all about data locality. Lack of any OpenMP directives prevents the data from staying local to a processor and thus the need of using locality tools on the SGI machines.

omplace does not pin OpenMP threads properly when used with the Intel OpenMP library shipped with Intel compiler package versions 10.1.015 and 10.1.017. This library is incompatible with dplace and omplace because it introduces CPU affinity functionality without the ability to disable it.

Thus I was able to resolve the issue with latest Intel OpenMP library, KMP_AFFINITY=disabled and omplace.

Thanks,
Amit