Extensive memory usage error for large matrices in MPI/OpenMP hybrid Fortran code - Page 2

Amruth_A_ · ‎10-16-2017

Hello everyone, I'm working on some astrophysical simulations and I've written a Fortran code. I have implemented MPI and OpenMP to create a hybrid code. The code works fine, with optimisations -O3, for small-medium array sizes. However, when I try to increase the size of the matrices to obtain higher resolution, I run out of memory on the nodes in the HPC cluster, to be exact 96 GB on each node. I've tried running on 4 nodes with 20 cores each. Is there any way I can run my code while using larger matrix sizes without having such a large memory requirement? I've read about using allocatable arrays and I've tried, but the memory reduction is only by less than 1 %, perhaps I'm allocating and deallocating them wrong. Basically, I read four external input arrays (X,Y,zeta_list,M_list) and I only have one major calculation that's happening in my code within a do loop, it loops over coordinates(in the array zeta_list), so I use MPI_scatter to break these up into chunks. I've shown an example code below so it might be easier to understand what's going on. Would appreciate any help on this!!!

<span style="font-family: Arial, 宋体, Tahoma, Helvetica, sans-serif; font-size: 1em;"> </span>
PROGRAM TEST_MPI

use mpi

IMPLICIT NONE 

DOUBLE PRECISION,DIMENSION(1,1000000)::M_list
DOUBLE PRECISION,DIMENSION(20000,20000)::X,Y,z_m_z_x,z_m_z_y,dist_z_m_z,alpha_x, alpha_y, sub_alpha_x, sub_alpha_y
DOUBLE PRECISION,DIMENSION(2,1000000)::zeta_list
DOUBLE PRECISION,ALLOCATABLE,DIMENSION(:,:)::sub_zeta_list
INTEGER::i,myid,ntasks,ierror,n_elements,chunk,t1,t2,clock_rate,clock_max,required,provided


call MPI_INIT_THREAD(MPI_THREAD_SERIALIZED,provided,ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, ntasks, ierror)

if (myid==0) then
	call system_clock ( t1, clock_rate, clock_max )
end if 

open(10,file='/home/h1352888/Fortran_Test/10000points_5000grid/M_list.bin',form='UNFORMATTED',access='STREAM')
open(9,file='/home/h1352888/Fortran_Test/10000points_5000grid/zeta_list.bin',form='UNFORMATTED',access='STREAM')
open(8,file='/home/h1352888/Fortran_Test/10000points_5000grid/X.bin',form='UNFORMATTED',access='STREAM')
open(7,file='/home/h1352888/Fortran_Test/10000points_5000grid/Y.bin',form='UNFORMATTED',access='STREAM')

read(10)M_list
read(9)zeta_list
read(8)X
read(7)Y


alpha_x = 0
alpha_y = 0 
sub_alpha_x = 0 
sub_alpha_y = 0 
z_m_z_x = 0
z_m_z_y = 0 
dist_z_m_z = 0 
n_elements = size(zeta_list,1)*size(zeta_list,2)
chunk = n_elements/ntasks
allocate(sub_zeta_list(2,chunk/2))
sub_zeta_list = 0 

!The final output arrays are alpha_x and alpha_y which are obtained from their sub_ counterparts via the loop.

call MPI_Scatter(zeta_list,chunk,MPI_DOUBLE_PRECISION,sub_zeta_list,chunk,MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierror)
!$OMP PARALLEL DO SCHEDULE(GUIDED,50) private(z_m_z_x, z_m_z_y, dist_z_m_z) reduction( +: sub_alpha_x, sub_alpha_y )
do i=1,chunk/2,1	
    	z_m_z_x = X - sub_zeta_list(1,i)
	z_m_z_y = Y - sub_zeta_list(2,i)
	dist_z_m_z = z_m_z_x**2 + z_m_z_y**2
	sub_alpha_x = sub_alpha_x + (M_list(1,i)* z_m_z_x / dist_z_m_z)
        sub_alpha_y = sub_alpha_y + (M_list(1,i)* z_m_z_y / dist_z_m_z)
end do
!$OMP END PARALLEL DO 

call MPI_Reduce(sub_alpha_x,alpha_x,400000000,MPI_DOUBLE_PRECISION,MPI_SUM,0,MPI_COMM_WORLD,ierror)    
call MPI_Reduce(sub_alpha_y,alpha_y,400000000,MPI_DOUBLE_PRECISION,MPI_SUM,0,MPI_COMM_WORLD,ierror)    

if (myid == 0) then
	write(*,*)'Number of processes is', ntasks
	write(*,*)chunk
	!write(*,*)alpha_x
	call system_clock ( t2, clock_rate, clock_max )
	write ( *, * ) 'Elapsed real time = ', real ( t2 - t1 ) / real ( clock_rate ) ,'seconds'
	open(12,file='/home/h1352888/Fortran_Test/10000points_5000grid/alpha_x.bin',form='UNFORMATTED',access='STREAM')
	write(12)alpha_x
	open(11,file='/home/h1352888/Fortran_Test/10000points_5000grid/alpha_y.bin',form='UNFORMATTED',access='STREAM')
	write(11)alpha_y
end if 

call MPI_FINALIZE(ierror)
stop
END PROGRAM TEST_MPI

I compile with mpiifort -g -traceback -qopenmp -O3 -ipo -xHost -mt_mpi test.f90, I also know about the -heap-arrays flag(which slows down the code significantly, 2x-3x) which I use sometimes, or I simply set OMP_STACKSIZE, although I know this isn't good for larger matrices. For example, if I run on 4 nodes and 20 cores, I set 4 MPI processes and let OpenMP have 20 threads.

Thanks a lot

jimdempseyatthecove · ‎11-05-2017

JD>>thread iteration split points are at intervals of multiples of those arrays

Now that I have looked at my statement, I am not entirely sure it is correct. The unused cells for each thread should have carried through as 0's.

Jim Dempsey

jimdempseyatthecove · ‎11-05-2017

>>However collapse(3) gives me nonsensical results

This may also be a possible cause.

You are using !$OMP PARALLEL DO COLLAPSE(2) SCHEDULE(GUIDED,100) reduction( +: sub_alpha_x, sub_alpha_y)

Your arrays are double, cache line width is 64 == 8 doubles. Chunk size of 100 is not a multiple of cache line size.

Try collapse(3) and 128, or 120, or 112, or 104 or 96)

Note, the chunk size being a multiple of elements in cache line is merely a means to understand the cause behind the nonsensical results, and not a suggestion that the best performance is to be yielded with schedule(guided,104).

Without chunk size, you may find using

!$OMP PARALLEL DO SIMD COLLAPSE(3) SCHEDULE(STATIC) reduction( +: sub_alpha_x, sub_alpha_y)

Will give you a good starting point.

ompChunk = ((n_elements/ntasks/2)*size(x) + 7) / nThreads ! or +15 when using real(4)
ompChunk = MAX(ompChunk - mod(ompChunk, 8), yourDeterminedBestSmallestChunkSize) ! or 16 when using real(4)
!$OMP PARALLEL DO SIMD COLLAPSE(3) SCHEDULE(DYNAMIC,ompChunk) reduction( +: sub_alpha_x, sub_alpha_y)

The above may be better, other choices can be better too. You would be best advised to programmicaly determine the best method for use on your typical production systems together with representative test data.

Jim Dempsey