Quenstion regarding weird bug with OpenMP parallelization

Marius13 · ‎05-09-2023

Hello,
so first I have to mention, that I am new to the Intel Fortran Compiler and there may be some obvious things that I am not doing/taking into account.

So I have recently switched to Intel fortran for the possibility of parallelisation. I am now using OpenMP to calculate many of the loops in my program that seemed to make sense in parallel and what I noticed is that the perfomance went faster and faster with more threads until around 15 threads and after that it started getting slower and slower (I have aaccess to a Server with 190 threads).

I was looking for the issue and found this weird behaviour that I could replicate with a test program (copy below). When I have an empty parallel region inside of a loop with a larger array definition (below its U2 = U = 128x128x50x2), the program slows down if I use more and more loops.
Even in the case where the arrays U and U2 are empty and the parallel region is completely empty as well.
And if I deactivate the parallel region or delete the array definition it goes as fast as usual. If I calculate the array definition and the empty parallel region inside of two different loops it also goes fast.

So maybe there seems to be some sort of memory problem or false sharing or something? I could not find any answers what is happening here which is why I was hoping some of you could help me out, it would be much appreciated. Thanks!

---------------------

program Test_Parallel2

USE omp_lib

IMPLICIT NONE

INTEGER, parameter :: jmax=128, lmax=128, Nbr_S=1, Nbr_SSS=50

REAL (KIND(0.0)), DIMENSION(100) :: Simulation_Time

REAL (KIND(0.0)), DIMENSION(Nbr_SSS,2,jmax,lmax) :: U

INTEGER :: Time1, Time2, rate

INTEGER :: iMaxThreads, kmax, k

REAL (KIND(0.0)) :: total_time

REAL (KIND(0.0)), DIMENSION(Nbr_SSS,2,jmax,lmax) :: U2

REAL (KIND(0.0)), DIMENSION(Nbr_SSS,2,jmax,lmax) :: Var1

!Intital Definitions

kmax = 50

Simulation_Time = 0.

U = 0.

U2=0.

!$ iMaxThreads = OMP_GET_MAX_THREADS()

!$ iMaxThreads = 50 !I override the number of active threads here for test purposes

!$ call OMP_SET_NUM_THREADS(iMaxThreads-1)

!$ Print *, "OpenMP active", iMaxThreads

CALL system_clock(count_rate=rate)

CALL SYSTEM_CLOCK(Time1)

DO k=1, kmax, 1

U2(:,:,:,:) = U(:,:,:,:)

!$OMP PARALLEL

!$OMP END PARALLEL

END DO

CALL SYSTEM_CLOCK(Time2)

total_time = real(Time2-Time1)/real(rate)

write(*,*) 'Time taken (s):', total_time

PAUSE 10

end program Test_Parallel2

---------------------

Marius13 · ‎05-09-2023

Maybe some information on the compiler properties that I am using:

/nologo /MP /O1 /assume:buffered_io /heap-arrays0 /Qopenmp /fp:strict /module:"x64\Release\\" /object:"x64\Release\\" /Fd"x64\Release\vc170.pdb" /libs:static /threads /Qmkl:sequential /c

Also, in the Linker->System configuration I set the Stack Reserve Size pretty high at around 500000000 - otherwise I get an overflow error.

jimdempseyatthecove · ‎05-10-2023

Running through an empty (or do nothing) parallel region (that is not removed/elided by compiler optimizations will incur the overhead of starting or resuming the thread team of that parallel region. More threads, more overhead. It is only when the parallel region has sufficient parallizable code to surpass the overhead that parallelization becomes effective.

Note, your parallel region (of your actual project) may contain serialized (critical) sections, which will intercede with parallization for that section of code. Some system functions have critical sections. Examples: random number generators, memory allocation/deallocation. Also note that should "Reallocation of lefthand side" result in reallocation (memory deallocation and allocation) that those statements will serialize.

Without seeing your real code, I will make an assumption:

    REAL (KIND(0.0)), DIMENSION(Nbr_SSS,2,jmax,lmax) :: U2
    REAL (KIND(0.0)), DIMENSION(Nbr_SSS,2,jmax,lmax) :: Var1

that your intention is to have each thread work on an individual Nbr_SSS section of the arrays.

Please note that, as you have dimensioned the arrays, this will be very inefficient with cache line usage.

Rework your code to use:

    REAL (KIND(0.0)), DIMENSION(2,jmax,lmax, Nbr_SSS) :: U2
    REAL (KIND(0.0)), DIMENSION(2,jmax,lmax, Nbr_SSS) :: Var1

or possibly:

    REAL (KIND(0.0)), DIMENSION(jmax,lmax, 2, Nbr_SSS) :: U2
    REAL (KIND(0.0)), DIMENSION(jmax,lmax, 2, Nbr_SSS) :: Var1

Be aware the interations on jmax are on adjacent locations in memory.

The Fortran dimension order for adjacent memory locations is inverse that of C/C++.

Jim Dempsey