MPI Fortran nested loops

Anders_S_1 · ‎10-26-2019

Hi,

I have parallellized a calculation of a Jacobean matrix using MPI and Intel Fortran. It seems that up to some 18 threads

will speed up the calculation.

I now want to parallellize the next loop lying outside the Jacobean loop by going to more cores. Can I do this without modifying the

code of the Jacobean or which general changes are necessary? Is there any code examples for such a problem?

Best regards

Anders S

jimdempseyatthecove · ‎10-28-2019

Can you describe your system?

If the 18 threads are on a single system (memory is shared), then your first consideration should be to use OpenMP as opposed to MPI.

MPI is targeted to distributed systems (memory is not shared).

Jim Dempsey

Anders_S_1 · ‎10-28-2019

Dear Jim,

A relevant comment but I forgot to mention that I can not use MPI due to a third party subroutine package. Therefore I have to use MPI.

To be short, diffusion is calculated in spherical particles. I have two loops; one outer loop over particle diameters and one inner loop calculating the diffusion in a single particle. To solve the diffusion problem, a Jacobean matrix has to be calculated. This calculation has been parallellized using MPI. Up to some 18 threads will give a speedup

DO i=1,ndiam

CALL CalculateDiffusion(size,rank,diam(i))

ENDDO

If I have 36 threads at my disposal and ndiam=2, how do I modify the MPI commands to enable a parallellazation also over particles taken that each particle is given 18 threads?

Best regards

Anders S

jimdempseyatthecove · ‎10-28-2019

This would depend upon whether or not the third party package you use is thread safe or not. You can test this to some extent by running OpemMP on top of MPI and check the result data with that produced with the code as is.

!$omp parallel do
DO i=1,ndiam
CALL CalculateDiffusion(size,rank,diam(i))
ENDDO
!$omp end parallel do

Remember to compile with the language feature options to enable OpenMP

*** You may also need to make thread-safe adjustments for code in your control. Not shown is your code in CalculateDiffusion.

In particular, your call above does not pass an array of particles. Therefor I must assume that the array of particles is global (e.g. specified in a module shared amongst procedures). What you have not disclosed is if diam is an array of arrays (iow diam is an array of user defined type, where the type has an array of particles of a given diameter). If(when) this is the case, then (presumably) each diam can be written into independently (thread safe).

Do the different diameter particles interact with each other (in addition to same diameter particles interacting amongst themselves)?

For a different perspective on diffusion programming take a look at:

High Performance Parallelism Pearls by James Reinders, Jim Jeffers
Copyright (c) 2015 Elsevier Inc.

Chapter 5, Plesiochronous Phasing Barriers (by Jim Dempsey)

This article addresses what can be done with a diffusion simulation. As to if this fits in with your requirements, I cannot say without seeing your code.

FWIW http://www.lotsofcores.com/ has a summary of this article about half way down the web page.

Jim Dempsey

Anders_S_1 · ‎10-28-2019

Dear Jim,

I have checked OpenMP earlier and it does not work, unfortunately. MPI works perfect.

The particles interact with each other via the external boundary conditions, which are iteratively modified outside the loop over diameters.in an iterative loop. Therefore, each call of the CalculateDiffusion subroutine for a single diameter for time step t to t+dtime is independent of all other calls of CalculateDiffusion.

Therefore, my case is extremely simple and straithforward and I am only asking for the most obvious way to include the diameter loop in the MPI treatment.

Alternatively, consider the very "clean" problem

DO i=1,n

CALL IntegrateSineFunction(param(i))

ENDDO

where the integration uses m threads, there are n parameters values param(i), i=1,...,n..

How do I parallellize also over the parameter loop, if I have m*n threads at my disposal?

Best regards

Anders S

Anders_S_1 · ‎10-31-2019

Hi,

If I use MPI_COMM_split to group my 36 threads into two groups diam1 and diam2, each with 18 threads,will this information be available

in all subroutines with INCLUDE mpif.h when I replace MPI_COMM_WORLD with diam1 or diam2 in e.g. in calls of MPI_REDUCE?

Best regards

Anders S

jimdempseyatthecove · ‎11-01-2019

1>> each call of the CalculateDiffusion subroutine for a single diameter for time step t to t+dtime is independent of all other calls of CalculateDiffusion

ergo

CALL CalculateDiffusion(size,rank,diam(1)); CALL CalculateDiffusion(size,rank,diam(2));... CALL CalculateDiffusion(size,rank,diam(ndiam))

can occur in parallel (within rank of size#s of ranks)

2>>I have checked OpenMP earlier and it does not work

ergo 1>> is not correct .OR. 2>> is not correct

You have provided too little information as to how best to increase the thread count.

When desiring to use a combination of MPI and OpenMP you should refer to the documentation:

https://software.intel.com/en-us/mpi-developer-guide-windows-running-an-mpi-openmp-program

You need to set both of these environment variables:

I_MPI_PIN_DOMAIN=omp
OMP_NUM_THREADS= n

Where n is the number of OpenMP threads (and number of logical processors per node) per rank.

Note, you should also select KMP_AFFINITY=compact or scatter or use the OMP_... related environment variables.

Also, be mindful that OMP_NUM_THREADS * number of processes on a node should not exceed the number of logical processors on the node.

I (we) would have to see more detail than what you posted in #5.

For example, without seeing IntegrateSineFunction, it is not known as if multiple param(i)'s can be placed into a SIMD vector then processed as vectors, .OR. if the IntegrateSineFunction would benefit from executing by multiple threads (within a rank), .OR. some combination of both.

Jim Dempsey

jimdempseyatthecove · ‎11-01-2019

From your description in post #6 it sounds like you have a single workstation/server containing either:

One 18 core CPU with HT enabled
,OR,
Two 18 core CPUs with HT disabled

(I am not aware of any 9 core CPUs or 3-socket motherboards)

With this configuration, if it is at all possible to run the entire application within OpenMP then this would be the better route.

MPI (distributed) method is typically used on a single SMP system only when the application, or 3rd party library used by the application, uses statically assigned workspaces (or when it is incapable of performing a reduction to a shared object).

Jim Dempsey