Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28548 Discussions

Quenstion regarding weird bug with OpenMP parallelization


so first I have to mention, that I am new to the Intel Fortran Compiler and there may be some obvious things that I am not doing/taking into account.


So I have recently switched to Intel fortran for the possibility of parallelisation. I am now using OpenMP to calculate many of the loops in my program that seemed to make sense in parallel and what I noticed is that the perfomance went faster and faster with more threads until around 15 threads and after that it started getting slower and slower (I have aaccess to a Server with 190 threads).

I was looking for the issue and found this weird behaviour that I could replicate with a test program (copy below). When I have an empty parallel region inside of a loop with a larger array definition (below its U2 = U = 128x128x50x2), the program slows down if I use more and more loops.
Even in the case where the arrays U and U2 are empty and the parallel region is completely empty as well.
And if I deactivate the parallel region or delete the array definition it goes as fast as usual. If I calculate the array definition and the empty parallel region inside of two different loops it also goes fast.

So maybe there seems to be some sort of memory problem or false sharing or something? I could not find any answers what is happening here which is why I was hoping some of you could help me out, it would be much appreciated. Thanks!



program Test_Parallel2
USE omp_lib
INTEGER, parameter :: jmax=128, lmax=128, Nbr_S=1, Nbr_SSS=50
REAL (KIND(0.0)), DIMENSION(100) :: Simulation_Time
REAL (KIND(0.0)), DIMENSION(Nbr_SSS,2,jmax,lmax) :: U
    INTEGER :: Time1, Time2, rate
    INTEGER :: iMaxThreads, kmax, k
    REAL (KIND(0.0)) :: total_time
    REAL (KIND(0.0)), DIMENSION(Nbr_SSS,2,jmax,lmax) :: U2
    REAL (KIND(0.0)), DIMENSION(Nbr_SSS,2,jmax,lmax) :: Var1
    !Intital Definitions
    kmax = 50
    Simulation_Time = 0.
    U = 0.
    !$ iMaxThreads = OMP_GET_MAX_THREADS()   
    !$ iMaxThreads = 50          !I override the number of active threads here for test purposes
    !$ call    OMP_SET_NUM_THREADS(iMaxThreads-1) 
    !$ Print *, "OpenMP active", iMaxThreads
CALL system_clock(count_rate=rate)
       DO k=1, kmax, 1
            U2(:,:,:,:) = U(:,:,:,:)
            !$OMP PARALLEL 
            !$OMP END PARALLEL
       END DO
  total_time = real(Time2-Time1)/real(rate)
  write(*,*) 'Time taken (s):', total_time
end program Test_Parallel2



Labels (1)
0 Kudos
2 Replies

Maybe some information on the compiler properties that I am using:

/nologo /MP /O1 /assume:buffered_io /heap-arrays0 /Qopenmp /fp:strict /module:"x64\Release\\" /object:"x64\Release\\" /Fd"x64\Release\vc170.pdb" /libs:static /threads /Qmkl:sequential /c

Also, in the Linker->System configuration I set the Stack Reserve Size pretty high at around 500000000 - otherwise I get an overflow error.

0 Kudos
Honored Contributor III

Running through an empty (or do nothing) parallel region (that is not removed/elided by compiler optimizations will incur the overhead of starting or resuming the thread team of that parallel region. More threads, more overhead. It is only when the parallel region has sufficient parallizable code to surpass the overhead that parallelization becomes effective.

Note, your parallel region (of your actual project) may contain serialized (critical) sections, which will intercede with parallization for that section of code. Some system functions have critical sections. Examples: random number generators, memory allocation/deallocation. Also note that should "Reallocation of lefthand side" result in reallocation (memory deallocation and allocation) that those statements will serialize.

Without seeing your real code, I will make an assumption:

    REAL (KIND(0.0)), DIMENSION(Nbr_SSS,2,jmax,lmax) :: U2
    REAL (KIND(0.0)), DIMENSION(Nbr_SSS,2,jmax,lmax) :: Var1

that your intention is to have each thread work on an individual Nbr_SSS section of the arrays.

Please note that, as you have dimensioned the arrays, this will be very inefficient with cache line usage.

Rework your code to use:

    REAL (KIND(0.0)), DIMENSION(2,jmax,lmax, Nbr_SSS) :: U2
    REAL (KIND(0.0)), DIMENSION(2,jmax,lmax, Nbr_SSS) :: Var1

or possibly:

    REAL (KIND(0.0)), DIMENSION(jmax,lmax, 2, Nbr_SSS) :: U2
    REAL (KIND(0.0)), DIMENSION(jmax,lmax, 2, Nbr_SSS) :: Var1

Be aware the interations on jmax are on adjacent locations in memory.

The Fortran dimension order for adjacent memory locations is inverse that of C/C++.


Jim Dempsey