- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello everyone, I'm working on some astrophysical simulations and I've written a Fortran code. I have implemented MPI and OpenMP to create a hybrid code. The code works fine, with optimisations -O3, for small-medium array sizes. However, when I try to increase the size of the matrices to obtain higher resolution, I run out of memory on the nodes in the HPC cluster, to be exact 96 GB on each node. I've tried running on 4 nodes with 20 cores each. Is there any way I can run my code while using larger matrix sizes without having such a large memory requirement? I've read about using allocatable arrays and I've tried, but the memory reduction is only by less than 1 %, perhaps I'm allocating and deallocating them wrong. Basically, I read four external input arrays (X,Y,zeta_list,M_list) and I only have one major calculation that's happening in my code within a do loop, it loops over coordinates(in the array zeta_list), so I use MPI_scatter to break these up into chunks. I've shown an example code below so it might be easier to understand what's going on. Would appreciate any help on this!!!
<span style="font-family: Arial, 宋体, Tahoma, Helvetica, sans-serif; font-size: 1em;"> </span> PROGRAM TEST_MPI use mpi IMPLICIT NONE DOUBLE PRECISION,DIMENSION(1,1000000)::M_list DOUBLE PRECISION,DIMENSION(20000,20000)::X,Y,z_m_z_x,z_m_z_y,dist_z_m_z,alpha_x, alpha_y, sub_alpha_x, sub_alpha_y DOUBLE PRECISION,DIMENSION(2,1000000)::zeta_list DOUBLE PRECISION,ALLOCATABLE,DIMENSION(:,:)::sub_zeta_list INTEGER::i,myid,ntasks,ierror,n_elements,chunk,t1,t2,clock_rate,clock_max,required,provided call MPI_INIT_THREAD(MPI_THREAD_SERIALIZED,provided,ierror) call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierror) call MPI_COMM_SIZE(MPI_COMM_WORLD, ntasks, ierror) if (myid==0) then call system_clock ( t1, clock_rate, clock_max ) end if open(10,file='/home/h1352888/Fortran_Test/10000points_5000grid/M_list.bin',form='UNFORMATTED',access='STREAM') open(9,file='/home/h1352888/Fortran_Test/10000points_5000grid/zeta_list.bin',form='UNFORMATTED',access='STREAM') open(8,file='/home/h1352888/Fortran_Test/10000points_5000grid/X.bin',form='UNFORMATTED',access='STREAM') open(7,file='/home/h1352888/Fortran_Test/10000points_5000grid/Y.bin',form='UNFORMATTED',access='STREAM') read(10)M_list read(9)zeta_list read(8)X read(7)Y alpha_x = 0 alpha_y = 0 sub_alpha_x = 0 sub_alpha_y = 0 z_m_z_x = 0 z_m_z_y = 0 dist_z_m_z = 0 n_elements = size(zeta_list,1)*size(zeta_list,2) chunk = n_elements/ntasks allocate(sub_zeta_list(2,chunk/2)) sub_zeta_list = 0 !The final output arrays are alpha_x and alpha_y which are obtained from their sub_ counterparts via the loop. call MPI_Scatter(zeta_list,chunk,MPI_DOUBLE_PRECISION,sub_zeta_list,chunk,MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierror) !$OMP PARALLEL DO SCHEDULE(GUIDED,50) private(z_m_z_x, z_m_z_y, dist_z_m_z) reduction( +: sub_alpha_x, sub_alpha_y ) do i=1,chunk/2,1 z_m_z_x = X - sub_zeta_list(1,i) z_m_z_y = Y - sub_zeta_list(2,i) dist_z_m_z = z_m_z_x**2 + z_m_z_y**2 sub_alpha_x = sub_alpha_x + (M_list(1,i)* z_m_z_x / dist_z_m_z) sub_alpha_y = sub_alpha_y + (M_list(1,i)* z_m_z_y / dist_z_m_z) end do !$OMP END PARALLEL DO call MPI_Reduce(sub_alpha_x,alpha_x,400000000,MPI_DOUBLE_PRECISION,MPI_SUM,0,MPI_COMM_WORLD,ierror) call MPI_Reduce(sub_alpha_y,alpha_y,400000000,MPI_DOUBLE_PRECISION,MPI_SUM,0,MPI_COMM_WORLD,ierror) if (myid == 0) then write(*,*)'Number of processes is', ntasks write(*,*)chunk !write(*,*)alpha_x call system_clock ( t2, clock_rate, clock_max ) write ( *, * ) 'Elapsed real time = ', real ( t2 - t1 ) / real ( clock_rate ) ,'seconds' open(12,file='/home/h1352888/Fortran_Test/10000points_5000grid/alpha_x.bin',form='UNFORMATTED',access='STREAM') write(12)alpha_x open(11,file='/home/h1352888/Fortran_Test/10000points_5000grid/alpha_y.bin',form='UNFORMATTED',access='STREAM') write(11)alpha_y end if call MPI_FINALIZE(ierror) stop END PROGRAM TEST_MPI
I compile with mpiifort -g -traceback -qopenmp -O3 -ipo -xHost -mt_mpi test.f90, I also know about the -heap-arrays flag(which slows down the code significantly, 2x-3x) which I use sometimes, or I simply set OMP_STACKSIZE, although I know this isn't good for larger matrices. For example, if I run on 4 nodes and 20 cores, I set 4 MPI processes and let OpenMP have 20 threads.
Thanks a lot
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
JD>>thread iteration split points are at intervals of multiples of those arrays
Now that I have looked at my statement, I am not entirely sure it is correct. The unused cells for each thread should have carried through as 0's.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>However collapse(3) gives me nonsensical results
This may also be a possible cause.
You are using !$OMP PARALLEL DO COLLAPSE(2) SCHEDULE(GUIDED,100) reduction( +: sub_alpha_x, sub_alpha_y)
Your arrays are double, cache line width is 64 == 8 doubles. Chunk size of 100 is not a multiple of cache line size.
Try collapse(3) and 128, or 120, or 112, or 104 or 96)
Note, the chunk size being a multiple of elements in cache line is merely a means to understand the cause behind the nonsensical results, and not a suggestion that the best performance is to be yielded with schedule(guided,104).
Without chunk size, you may find using
!$OMP PARALLEL DO SIMD COLLAPSE(3) SCHEDULE(STATIC) reduction( +: sub_alpha_x, sub_alpha_y)
Will give you a good starting point.
ompChunk = ((n_elements/ntasks/2)*size(x) + 7) / nThreads ! or +15 when using real(4)
ompChunk = MAX(ompChunk - mod(ompChunk, 8), yourDeterminedBestSmallestChunkSize) ! or 16 when using real(4)
!$OMP PARALLEL DO SIMD COLLAPSE(3) SCHEDULE(DYNAMIC,ompChunk) reduction( +: sub_alpha_x, sub_alpha_y)
The above may be better, other choices can be better too. You would be best advised to programmicaly determine the best method for use on your typical production systems together with representative test data.
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »