Extensive memory usage error for large matrices in MPI/OpenMP hybrid Fortran code

Amruth_A_ · ‎10-16-2017

Hello everyone, I'm working on some astrophysical simulations and I've written a Fortran code. I have implemented MPI and OpenMP to create a hybrid code. The code works fine, with optimisations -O3, for small-medium array sizes. However, when I try to increase the size of the matrices to obtain higher resolution, I run out of memory on the nodes in the HPC cluster, to be exact 96 GB on each node. I've tried running on 4 nodes with 20 cores each. Is there any way I can run my code while using larger matrix sizes without having such a large memory requirement? I've read about using allocatable arrays and I've tried, but the memory reduction is only by less than 1 %, perhaps I'm allocating and deallocating them wrong. Basically, I read four external input arrays (X,Y,zeta_list,M_list) and I only have one major calculation that's happening in my code within a do loop, it loops over coordinates(in the array zeta_list), so I use MPI_scatter to break these up into chunks. I've shown an example code below so it might be easier to understand what's going on. Would appreciate any help on this!!!

<span style="font-family: Arial, 宋体, Tahoma, Helvetica, sans-serif; font-size: 1em;"> </span>
PROGRAM TEST_MPI

use mpi

IMPLICIT NONE 

DOUBLE PRECISION,DIMENSION(1,1000000)::M_list
DOUBLE PRECISION,DIMENSION(20000,20000)::X,Y,z_m_z_x,z_m_z_y,dist_z_m_z,alpha_x, alpha_y, sub_alpha_x, sub_alpha_y
DOUBLE PRECISION,DIMENSION(2,1000000)::zeta_list
DOUBLE PRECISION,ALLOCATABLE,DIMENSION(:,:)::sub_zeta_list
INTEGER::i,myid,ntasks,ierror,n_elements,chunk,t1,t2,clock_rate,clock_max,required,provided


call MPI_INIT_THREAD(MPI_THREAD_SERIALIZED,provided,ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, ntasks, ierror)

if (myid==0) then
	call system_clock ( t1, clock_rate, clock_max )
end if 

open(10,file='/home/h1352888/Fortran_Test/10000points_5000grid/M_list.bin',form='UNFORMATTED',access='STREAM')
open(9,file='/home/h1352888/Fortran_Test/10000points_5000grid/zeta_list.bin',form='UNFORMATTED',access='STREAM')
open(8,file='/home/h1352888/Fortran_Test/10000points_5000grid/X.bin',form='UNFORMATTED',access='STREAM')
open(7,file='/home/h1352888/Fortran_Test/10000points_5000grid/Y.bin',form='UNFORMATTED',access='STREAM')

read(10)M_list
read(9)zeta_list
read(8)X
read(7)Y


alpha_x = 0
alpha_y = 0 
sub_alpha_x = 0 
sub_alpha_y = 0 
z_m_z_x = 0
z_m_z_y = 0 
dist_z_m_z = 0 
n_elements = size(zeta_list,1)*size(zeta_list,2)
chunk = n_elements/ntasks
allocate(sub_zeta_list(2,chunk/2))
sub_zeta_list = 0 

!The final output arrays are alpha_x and alpha_y which are obtained from their sub_ counterparts via the loop.

call MPI_Scatter(zeta_list,chunk,MPI_DOUBLE_PRECISION,sub_zeta_list,chunk,MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierror)
!$OMP PARALLEL DO SCHEDULE(GUIDED,50) private(z_m_z_x, z_m_z_y, dist_z_m_z) reduction( +: sub_alpha_x, sub_alpha_y )
do i=1,chunk/2,1	
    	z_m_z_x = X - sub_zeta_list(1,i)
	z_m_z_y = Y - sub_zeta_list(2,i)
	dist_z_m_z = z_m_z_x**2 + z_m_z_y**2
	sub_alpha_x = sub_alpha_x + (M_list(1,i)* z_m_z_x / dist_z_m_z)
        sub_alpha_y = sub_alpha_y + (M_list(1,i)* z_m_z_y / dist_z_m_z)
end do
!$OMP END PARALLEL DO 

call MPI_Reduce(sub_alpha_x,alpha_x,400000000,MPI_DOUBLE_PRECISION,MPI_SUM,0,MPI_COMM_WORLD,ierror)    
call MPI_Reduce(sub_alpha_y,alpha_y,400000000,MPI_DOUBLE_PRECISION,MPI_SUM,0,MPI_COMM_WORLD,ierror)    

if (myid == 0) then
	write(*,*)'Number of processes is', ntasks
	write(*,*)chunk
	!write(*,*)alpha_x
	call system_clock ( t2, clock_rate, clock_max )
	write ( *, * ) 'Elapsed real time = ', real ( t2 - t1 ) / real ( clock_rate ) ,'seconds'
	open(12,file='/home/h1352888/Fortran_Test/10000points_5000grid/alpha_x.bin',form='UNFORMATTED',access='STREAM')
	write(12)alpha_x
	open(11,file='/home/h1352888/Fortran_Test/10000points_5000grid/alpha_y.bin',form='UNFORMATTED',access='STREAM')
	write(11)alpha_y
end if 

call MPI_FINALIZE(ierror)
stop
END PROGRAM TEST_MPI

I compile with mpiifort -g -traceback -qopenmp -O3 -ipo -xHost -mt_mpi test.f90, I also know about the -heap-arrays flag(which slows down the code significantly, 2x-3x) which I use sometimes, or I simply set OMP_STACKSIZE, although I know this isn't good for larger matrices. For example, if I run on 4 nodes and 20 cores, I set 4 MPI processes and let OpenMP have 20 threads.

Thanks a lot

Laura_S_3 · ‎10-16-2017

Looking at your code, it should not be blowing out of 96 GB, especially if there is only one MPI process per node. However, assuming N=5,000 is the number of unknowns, it will run out of memory somewhere around N=35,000 (assuming 1 MPI process per node). At what point did you start to see problems? MPI does have some overhead, but most MPI implementations aren't too wasteful.

Amruth_A_ · ‎10-16-2017

The values I put in this code are example values, let me re-edit to put in the actual matrix sizes. Out of the 4 input arrays, zeta_list and M_list are very small, the ones taking the large memory are the ones that in the above example are 5000 x 5000 arrays. Now, in the code that is taking up almost 96 GB, the sizes for zeta_list is (1000000,2), M_list is (1000000,1), while X and Y are 20,000 x 20,000. Apart from these initial input arrays, there are arrays that I have to initialize with equal dimensions ( 20,000 x 20,000 ) in the midst of the computations, mainly in the DO loop. I suppose these 20,000 x 20,000 arrays are the culprits I have to deal with. These are the values at which the code simply slowed down by a tonne, I presume due to using the swap memory, and stopped using most of the cores. These are still with 4 nodes ( 4 MPI processes ) and 20 OpenMP threads for each node.

Laura_S_3 · ‎10-18-2017

You have 9 20,000x20,000 double precision arrays. Each double precision value takes 8 bytes. Those arrays alone will take 28.8 GB of memory per MPI process. Like you said, the other arrays aren't consuming much, relatively speaking.

28.8 is still much less than 96 GB. Is it possible that you have more than one MPI process per node? On some HPCs, it is slightly tricky to get the process distributed where you want them in a hybrid MPI/OpenMP program. The command MPI_Get_processor_name can be used to see which physical nodes each process is on.

To go to larger problems sizes, you will need to rewrite your program. This isn't a Fortran problem, it is a memory usage problem. With a dense matrix problem, memory is going to scale with N^2. So, when you double the number of unknowns, you quadruple the amount of memory. The way the program is written, every process needs to have enough memory to store the whole array. You are going to have to change the way the program works so that each process stores only part of the array. ScaLAPACK has routines that do the sort of things you need.See ScaLAPACK block cyclic distribution. It looks hard at first, but once you understand it, it is straight-forward. For very large problems, Netlib has an out-of-core version of ScaLAPACK, but with model computers and dense matrices, you are probably going to hit time limitation before OOC ScaLAPACK is needed.

Also, skimming your program, I don't see where the different MPI processes are assigned to do different work. OMP won't work across different nodes or different MPI processes (unless some big improvements have been made recently. :-)) So, (I could be wrong) it looks to me like each node is doing the same work.

jimdempseyatthecove · ‎10-18-2017

Make the arrays allocatable. Leave the private arrays unallocated prior to the parallel region. Use FIRSTPRIVATE on the unallocated arrays, allocate inside of the parallel region (deallocate when you are about to exit the parallel region).

The compiler may also be generating temporary copies when you use implicit array operations. When this happens, the temporary copy can be eliminated with the use of explicit DO loops.

Jim Dempsey

Amruth_A_ · ‎10-19-2017

Laura, Yeah I did the calculation and I too don't know why 96 GB is being used when I check...Well, as to the more than one MPI process per node, I've checked, because I run the program with

export OMP_STACKSIZE=15g
export OMP_NUM_THREADS=${NTHREADS} 
export I_MPI_PIN_DOMAIN=omp
mpirun -ppn 1 -machinefile ${MACHFILE} -np 4 ${PBS_O_WORKDIR}/a.out >> ${OUTFILE}

So each node should be getting only 1 process...About rewriting the program, thanks for the suggestions will definitely check out ScaLAPACK. And finally about the MPI processes doing different work, if I understand correctly, the MPI_Scatter distributes chunks of the array zeta_list to each process, and then the OMP gets to work on the loop for each of these chunks. So isn't each MPI process only doing work on the chunk of the array rather than the whole array? So each node would contain different pieces of work right? Either way, I still can't figure out why so much memory is being used.....

Jim, Thanks for the suggestion, I did try out using allocatable arrays, but this turns out to even use more memory, if the way I've written the code is correct. I excluded the MPI and just used OpenMP to simplify the situation,

PROGRAM TEST_MPI

IMPLICIT NONE 
DOUBLE PRECISION,ALLOCATABLE,DIMENSION(:,:)::X,Y,z_m_z_x,z_m_z_y,alpha_x, alpha_y,zeta_list,M_list
INTEGER::i,t1,t2,clock_rate,clock_max
call system_clock ( t1, clock_rate, clock_max )

open(10,file='/data/h1352888/Fortran_Test/1mpoints_10kgrid/M_list.bin',form='UNFORMATTED',access='STREAM')
open(9,file='/data/h1352888/Fortran_Test/1mpoints_10kgrid/zeta_list.bin',form='UNFORMATTED',access='STREAM')
open(8,file='/data/h1352888/Fortran_Test/1mpoints_10kgrid/X.bin',form='UNFORMATTED',access='STREAM')
open(7,file='/data/h1352888/Fortran_Test/1mpoints_10kgrid/Y.bin',form='UNFORMATTED',access='STREAM')

allocate(X(10000,10000))
allocate(Y(10000,10000))
allocate(zeta_list(2,1000000))
allocate(M_list(1,1000000))
read(10)M_list
read(9)zeta_list
read(8)X
read(7)Y

allocate(alpha_x(10000,10000))
alpha_x = 0
allocate(alpha_y(10000,10000))
alpha_y = 0 

!$OMP PARALLEL  firstprivate(z_m_z_x, z_m_z_y)
allocate(z_m_z_x(10000,10000))
allocate(z_m_z_y(10000,10000))
z_m_z_x = 0
z_m_z_y = 0
!$OMP DO SCHEDULE(GUIDED,100) reduction( +: alpha_x,alpha_y)
do i=1,size(zeta_list,2),1
    	z_m_z_x = X - zeta_list(1,i)
	z_m_z_y = Y - zeta_list(2,i)
	alpha_x = alpha_x + (M_list(1,i)* z_m_z_x / (z_m_z_x**2 + z_m_z_y**2))
        alpha_y = alpha_y + (M_list(1,i)* z_m_z_y /(z_m_z_x**2 + z_m_z_y**2))

end do
!$OMP END DO
deallocate(z_m_z_x)
deallocate(z_m_z_y)
!$OMP END PARALLEL 

deallocate(X)
deallocate(Y)
deallocate(zeta_list)
deallocate(M_list)

open(17,file='/data/h1352888/Fortran_Test/1mpoints_10kgrid/alpha_x.bin',form='UNFORMATTED',access='STREAM')
write(17)alpha_x
open(16,file='/data/h1352888/Fortran_Test/1mpoints_10kgrid/alpha_y.bin',form='UNFORMATTED',access='STREAM')
write(16)alpha_y

deallocate(alpha_x)
deallocate(alpha_y)
call system_clock ( t2, clock_rate, clock_max )
write ( *, * ) 'Elapsed real time = ', real ( t2 - t1 ) / real ( clock_rate ) ,'seconds'
stop
END PROGRAM TEST_MPI

Apart from this, by implicit array operations, do you mean the calculations in the loop? I don't see any other place with array operations, apart from writing the alpha_x into output files.

UPDATE: I was trying out many combinations, and one particular case where I removed the reduction ( +: alpha_x, alpha_y) caused the memory usage to significantly drop down to the amount of memory that we expect to be used. As in, if it was 20,000 x 20,000 arrays, the memory usage is simply number of arrays * 20,000 * 20,000 * 8. I'm not sure why the reduction clause takes so much memory? Is there a way to circumvent this or would it be an issue if I simply omitted the reduction clause?

jimdempseyatthecove · ‎10-19-2017

do i=1,size(zeta_list,2),1
     z_m_z_x = X - zeta_list(1,i) ! z_m_z_x(1:10000, 1:10000) = scalar
!       array   = scalar - scalar
 z_m_z_y = Y - zeta_list(2,i) ! z_m_z_y(1:10000, 1:10000) = scalar
!       array   = scalar - scalar
 alpha_x = alpha_x + (M_list(1,i)* z_m_z_x / (z_m_z_x**2 + z_m_z_y**2))
!       array   = array + (scalar * array / (array**2 + array**2)) 
!       array   = array + (arrayTemp1 / (arrayTemp2 + arrayTemp3))
!       array   = array + (arrayTemp1 / arrayTemp4)
!       array   = array + arrayTemp5
!       array   = arrayTemp6
        alpha_y = alpha_y + (M_list(1,i)* z_m_z_y /(z_m_z_x**2 + z_m_z_y**2))
!       array   = array + (scalar * array / (array**2 + array**2)) 
!       array   = array + (arrayTemp1 / (arrayTemp2 + arrayTemp3))
!       array   = array + (arrayTemp1 / arrayTemp4)
!       array   = array + arrayTemp5
!       array   = arrayTemp6
end do

You may have at least 6, or possibly 12 array temporaries (usually on stack), each (10000,10000)*8 bytes

Jim Dempsey

Amruth_A_ · ‎10-19-2017

I see...Most of this extra memory usage is due to the private variables, as well as the reduction clause variables, could I explicitly write out the code in some way such that I don't have to make use of these clauses? Would that result in reduced memory usage?

jimdempseyatthecove · ‎10-19-2017

Your above code, I presume is sketch code, not what you are actually doing. If I presume wrong, and the above code was actual code...
... then why are z_m_z_x and z_m_z_y 2D arrays, all cells containing the same value??? It looks like the original code should work with scalars for z_m_z_x and z_m_z_y. Then the two arrays and 6 tempArrays (800,000,000 bytes each) would be two scalars and 6 temps (8 bytes each, probably all in registers).

Jim Dempsey

Laura_S_3 · ‎10-19-2017

Given this isn't the full program, I didn't try to figure out your algorithm, but is it possible you can read in a subset of your data, scatter, do computation, reduce, then output your data again? I.e. an out-of-core application.

Amruth_A_ · ‎10-19-2017

Jim, It is the actual code, z_m_z_x and z_m_z_y do not contain the same value, they are different because X is a 2D meshgrid, so each value in X is different. X can range, for example, from -50 to 50 in intervals based on the grid size, and then each of these coordinate points is subtracted from the x, and y coordinates in zeta_list respectively.

Laura, I do think that is my final end goal, to separate the data of each input array into subsets, I am doing that for the array zeta_list, but not for X and Y, because they are 2d meshgrid arrays and its a bit harder for me to visualise splitting them into chunks. Actually, my first post contains the full code, if you want to test it out, instead of inputting the external arrays, you can simply replace the input arrays with random numbers using call random_number(X) for example.

jimdempseyatthecove · ‎10-20-2017

integer :: j,k
...
!$OMP DO SCHEDULE(GUIDED,100) reduction( +: alpha_x,alpha_y)
do i=1,size(zeta_list,2),1
  do j=1,ubound(X,2)
    do k=1,ubound(X,1)
      z_m_z_x(k,j) = X(k,j) - zeta_list(1,i)
      z_m_z_y(k,j) = Y(k,j) - zeta_list(2,i)
      alpha_x(k,j) = alpha_x(k,j) + (M_list(1,i)* z_m_z_x(k,j) / (z_m_z_x(k,j)**2 + z_m_z_y(k,j)**2))
      alpha_y(k,j) = alpha_y(k,j) + (M_list(1,i)* z_m_z_y(k,j) /(z_m_z_x(k,j)**2 + z_m_z_y(k,j)**2))
    end do
  end do
end do
!$OMP END DO

The above will get rid of the array temporaries (excepting possibly the reduction).

Jim Dempsey

jimdempseyatthecove · ‎10-20-2017

Additional note,

The "unnecessary" array temporaries on implicit DO array assignments has been an issue for a long time with IFV. And it has been aggravated when the auto-reallocation of left hand side is used (which is default now). If you do not like explicit do loops, try disabling the realloc_lhs for this source.

Jim Dempsey

jimdempseyatthecove · ‎10-20-2017

Without explicit DO, possibly

20 cores (no HT)
20 * 2 * arrayOfDouble(10000,10000) = 32,000,000,000 for firstprivate(z_m_z_x, z_m_z_y)
20 * 6 * arrayOfDouble(10000,10000) = 96,000,000,000 for 6 temp arrays

128GB without HT
with HT 256 GB

Plus the other shared arrays (once)

Jim Dempsey

Amruth_A_ · ‎10-26-2017

I see, will try this out. Although it appears to scale up the simulation, I would need to consider other methods such as slicing all my input arrays into chunks. Thanks for the help!

jimdempseyatthecove · ‎10-26-2017

>> I would need to consider other methods such as slicing all my input arrays into chunks

Doesn't the explicit DO do just that?

Additional note re: slicing all my input arrays into chunks

After you decide on your slice points, you might find it handy to perform your allocations using the slice point positions

sliceSize = arraySize / rank_size
if(mod(sliceSize,8) .ne. 0) sliceSize = sliceSize + (8-mod(sliceSize,8)) ! cach line for double
theLbound = 1 + sliceSize * rank
theUbound = min(theLbound + sliceSize - 1), arraySize)
allocate(array(theLbound, theUbound))
...

or you can have each rank use 1:sliceSize - your preference

Jim Dempsey

Amruth_A_ · ‎11-01-2017

The explicit DO only splits the zeta_list, if I could split the X and Y's as well, I'm hoping to reduce the memory usage significantly...and will try your method, thanks!

jimdempseyatthecove · ‎11-03-2017

>>The explicit DO only splits the zeta_list, if I could split the X and Y's as well

OpenMP has a collapse clause.

integer :: j,k
...
! TBD: you determine if it is better to use COLLAPSE(2) or COLLAPSE(3)
!$OMP DO COLLAPSE(2) SCHEDULE(GUIDED,100) reduction( +: alpha_x,alpha_y)
do i=1,size(zeta_list,2),1
  do j=1,ubound(X,2)
    do k=1,ubound(X,1)
      z_m_z_x(k,j) = X(k,j) - zeta_list(1,i)
      z_m_z_y(k,j) = Y(k,j) - zeta_list(2,i)
      alpha_x(k,j) = alpha_x(k,j) + (M_list(1,i)* z_m_z_x(k,j) / (z_m_z_x(k,j)**2 + z_m_z_y(k,j)**2))
      alpha_y(k,j) = alpha_y(k,j) + (M_list(1,i)* z_m_z_y(k,j) /(z_m_z_x(k,j)**2 + z_m_z_y(k,j)**2))
    end do
  end do
end do
!$OMP END DO

Jim Dempsey

jimdempseyatthecove · ‎11-03-2017

Note, when the outer loop iteration count is several times the number of threads, then it may be best not to collapse the loops. You will have to experiment as to if GUIDED or DYNAMIC is better, however, when the work load is equal per iteration then STATIC scheduling might be better when the system is not busy running other processes.

Jim Dempsey

Amruth_A_ · ‎11-05-2017

Jim, thanks a lot for the continued advice, really appreciate it. So I did try the collapse(2) clause along with how you redefined the loop and it halves the memory usage as expected since z_m_z_x and z_m_z_y are not private(in the case of 2e3 x 2e3 arrays). However collapse(3) gives me nonsensical results, and the code completes running in < 1 second. I presume this is because of the inner loop being included? If it does help, I compiled with

ifort -O3 -qopenmp -g -xHost -ipo test_hybrid.f90

However, when trying the same method for getting rid of the array temporaries in a larger calculation which has 2e4 x 2e4 arrays(and outer loop iteration from 0 to 1 million), the 2nd method actually takes a few extra GB of memory than using the private variables method. Any idea why this might be happening?

Original method

!$OMP PARALLEL DO SCHEDULE(GUIDED,1000) private(z_m_z_x, z_m_z_y, dist_z_m_z) reduction( +: alpha_x, alpha_y)
do i=1,size(zeta_list,2),1	
    	z_m_z_x = X - zeta_list(1,i)
        z_m_z_y = Y - zeta_list(2,i)
        dist_z_m_z = z_m_z_x**2 + z_m_z_y**2
        alpha_x = alpha_x + (M_list(1,i)* z_m_z_x / dist_z_m_z)
        alpha_y = alpha_y + (M_list(1,i)* z_m_z_y / dist_z_m_z)
end do
!$OMP END PARALLEL DO

2nd method 

!$OMP PARALLEL DO COLLAPSE(1) SCHEDULE(GUIDED,1000) reduction( +: alpha_x,alpha_y)
do i=1,size(zeta_list,2),1
  do j=1,ubound(X,2)
    do k=1,ubound(X,1)
      z_m_z_x(k,j) = X(k,j) - zeta_list(1,i)
      z_m_z_y(k,j) = Y(k,j) - zeta_list(2,i)
      alpha_x(k,j) = alpha_x(k,j) + (M_list(1,i)* z_m_z_x(k,j) / (z_m_z_x(k,j)**2 + z_m_z_y(k,j)**2))
      alpha_y(k,j) = alpha_y(k,j) + (M_list(1,i)* z_m_z_y(k,j) /(z_m_z_x(k,j)**2 + z_m_z_y(k,j)**2))
    end do
  end do
end do
!$OMP END PARALLEL DO

On another note, although at first sight it appears that each iteration has equal amounts of workload, I found out there is a large load imbalance for long computations, I'm not sure what the source is, which is why I started to use GUIDED. I decided against DYNAMIC since I read that it has more overhead than GUIDED.

Secondly, regarding slicing the 2 additional input arrays (X and Y) ,ignore the M_list, I can do without it. The way I visualise it, say X is a meshgrid ranging from -100 to 100. What's going on right now is that z_m_z_x = X - zeta_list[1] calculates for the entire range -100 to 100. So one way to reduce the memory usage at unit time is to split X into 2 parts, -100 to 0 and 0 to 100. So then the loops all calculate the alpha_x's corresponding to -100 to 0, and then the alpha_x's are calculated for the 2nd part of X, from 0 to 100. So if I understand your slicing code correctly, is this what its doing?

Just for comparison, this is the current updated hybrid code with example array sizes, after using your method to deal with the array temporaries,

PROGRAM TEST_MPI

use mpi

IMPLICIT NONE 

DOUBLE PRECISION,DIMENSION(1,10000)::M_list
DOUBLE PRECISION,DIMENSION(2000,2000)::X,Y,z_m_z_x,z_m_z_y,alpha_x, alpha_y, sub_alpha_x, sub_alpha_y
DOUBLE PRECISION,DIMENSION(2,10000)::zeta_list
DOUBLE PRECISION,ALLOCATABLE,DIMENSION(:,:)::sub_zeta_list
INTEGER::i,j,k,myid,ntasks,ierror,n_elements,chunk,t1,t2,clock_rate,clock_max,required,provided

call MPI_INIT_THREAD(MPI_THREAD_SERIALIZED,provided,ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, ntasks, ierror)

if (myid==0) then
	call system_clock ( t1, clock_rate, clock_max )
end if 

! I load the input arrays, which are the X and Y grids, the coordinates of
! the mass points, as well as the value of the mass 

open(10,file='/home/h1352888/Fortran_Test/10000points_2000grid/M_list.bin',form='UNFORMATTED',access='STREAM')
open(9,file='/home/h1352888/Fortran_Test/10000points_2000grid/zeta_list.bin',form='UNFORMATTED',access='STREAM')
open(8,file='/home/h1352888/Fortran_Test/10000points_2000grid/X.bin',form='UNFORMATTED',access='STREAM')
open(7,file='/home/h1352888/Fortran_Test/10000points_2000grid/Y.bin',form='UNFORMATTED',access='STREAM')

read(10)M_list
read(9)zeta_list
read(8)X
read(7)Y


alpha_x = 0
alpha_y = 0 
sub_alpha_x = 0 
sub_alpha_y = 0 
z_m_z_x = 0
z_m_z_y = 0 
n_elements = size(zeta_list,1)*size(zeta_list,2) ! Chunk size defined for parallelisation
chunk = n_elements/ntasks
allocate(sub_zeta_list(2,chunk/2))	
sub_zeta_list = 0 

!This scatters the list of coordinates to each process 
!The DO loop runs over the total number of points, calculating alpha for each point at each grid point

call MPI_Scatter(zeta_list,chunk,MPI_DOUBLE_PRECISION,sub_zeta_list,chunk,MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierror)

!$OMP PARALLEL DO COLLAPSE(2) SCHEDULE(GUIDED,100) reduction( +: sub_alpha_x, sub_alpha_y)
do i=1,chunk/2,1
  do j=1,ubound(X,2)
    do k=1,ubound(X,1)
      z_m_z_x(k,j) = X(k,j) - sub_zeta_list(1,i)
      z_m_z_y(k,j) = Y(k,j) - sub_zeta_list(2,i)
      sub_alpha_x(k,j) = sub_alpha_x(k,j) + (M_list(1,i)* z_m_z_x(k,j) / (z_m_z_x(k,j)**2 + z_m_z_y(k,j)**2))
      sub_alpha_y(k,j) = sub_alpha_y(k,j) + (M_list(1,i)* z_m_z_y(k,j) /(z_m_z_x(k,j)**2 + z_m_z_y(k,j)**2))
    end do
  end do
end do
!$OMP END PARALLEL DO


call MPI_Reduce(sub_alpha_x,alpha_x,4000000,MPI_DOUBLE_PRECISION,MPI_SUM,0,MPI_COMM_WORLD,ierror)    
call MPI_Reduce(sub_alpha_y,alpha_y,4000000,MPI_DOUBLE_PRECISION,MPI_SUM,0,MPI_COMM_WORLD,ierror)    

!Collect the calculated data

if (myid == 0) then
	write(*,*)'Number of processes is', ntasks
	write(*,*)chunk
	call system_clock ( t2, clock_rate, clock_max )
	write ( *, * ) 'Elapsed real time = ', real ( t2 - t1 ) / real ( clock_rate ) ,'seconds'
	open(12,file='/home/h1352888/Fortran_Test/10000points_2000grid/alpha_x.bin',form='UNFORMATTED',access='STREAM')
	write(12)alpha_x
	open(11,file='/home/h1352888/Fortran_Test/10000points_2000grid/alpha_y.bin',form='UNFORMATTED',access='STREAM')
	write(11)alpha_y
end if 

call MPI_FINALIZE(ierror)
stop
END PROGRAM TEST_MPI

jimdempseyatthecove · ‎11-05-2017

>>However collapse(3) gives me nonsensical results, and the code completes running in < 1 second. I presume this is because of the inner loop being included?

This is likely due on, collapse(3), to a combination of the reduction operation occurring at the end of a thread slice .AND. the array split points of sub_alpha_x and sub_alpha_y not being at the array end points. On collapse(2), the thread iteration split points are at intervals of multiples of those arrays. IOW the reduction operation for each thread reduces the entire array for that thread as opposed to the specific slice of the array used by each thread.

>>memory increase?

One possible reason (may be more) is with collapse(2) .AND. chunk size of 100, that the number of partitions is less than the number of threads. Whereas collapse(3) uses more threads, and thus creates more private copies of the reduction arrays.

>>read(10)M_list; read(9)zeta_list; read(8)X; read(7)Y

Each rank is reading the same data as opposed to reading part of the data. It is a program design consideration as to if each rank reads all of the data as opposed to part of the data. When all data is read, then the MPI_Scatter might be avoidable (data is already located in each rank). For small number of ranks the latency may be less for each rank to read unnecessary data and omit the MPI_Scatter, but for larger number of ranks it may be more efficient for each rank to read a slice of the input data.

If you intend to use MPI_Scatter, then have only one rank (e.g. 0) perform the opens and reads, as well as opens and writes

>>chunk = n_elements / ntasks ! n/#ranks

When n_elements is not a multiple of number of ranks, then the division will have a remainder, and thus the tail-end remainder of the work will not be performed.

>>allocate(sub_zeta_list(2,chunk/2))

When chunk is not a multiple of 2, the above allocation is incorrect.

Jim Dempsey