- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
so first I have to mention, that I am new to the Intel Fortran Compiler and there may be some obvious things that I am not doing/taking into account.
So I have recently switched to Intel fortran for the possibility of parallelisation. I am now using OpenMP to calculate many of the loops in my program that seemed to make sense in parallel and what I noticed is that the perfomance went faster and faster with more threads until around 15 threads and after that it started getting slower and slower (I have aaccess to a Server with 190 threads).
I was looking for the issue and found this weird behaviour that I could replicate with a test program (copy below). When I have an empty parallel region inside of a loop with a larger array definition (below its U2 = U = 128x128x50x2), the program slows down if I use more and more loops.
Even in the case where the arrays U and U2 are empty and the parallel region is completely empty as well.
And if I deactivate the parallel region or delete the array definition it goes as fast as usual. If I calculate the array definition and the empty parallel region inside of two different loops it also goes fast.
So maybe there seems to be some sort of memory problem or false sharing or something? I could not find any answers what is happening here which is why I was hoping some of you could help me out, it would be much appreciated. Thanks!
---------------------
---------------------
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Maybe some information on the compiler properties that I am using:
/nologo /MP /O1 /assume:buffered_io /heap-arrays0 /Qopenmp /fp:strict /module:"x64\Release\\" /object:"x64\Release\\" /Fd"x64\Release\vc170.pdb" /libs:static /threads /Qmkl:sequential /c
Also, in the Linker->System configuration I set the Stack Reserve Size pretty high at around 500000000 - otherwise I get an overflow error.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Running through an empty (or do nothing) parallel region (that is not removed/elided by compiler optimizations will incur the overhead of starting or resuming the thread team of that parallel region. More threads, more overhead. It is only when the parallel region has sufficient parallizable code to surpass the overhead that parallelization becomes effective.
Note, your parallel region (of your actual project) may contain serialized (critical) sections, which will intercede with parallization for that section of code. Some system functions have critical sections. Examples: random number generators, memory allocation/deallocation. Also note that should "Reallocation of lefthand side" result in reallocation (memory deallocation and allocation) that those statements will serialize.
Without seeing your real code, I will make an assumption:
REAL (KIND(0.0)), DIMENSION(Nbr_SSS,2,jmax,lmax) :: U2
REAL (KIND(0.0)), DIMENSION(Nbr_SSS,2,jmax,lmax) :: Var1
that your intention is to have each thread work on an individual Nbr_SSS section of the arrays.
Please note that, as you have dimensioned the arrays, this will be very inefficient with cache line usage.
Rework your code to use:
REAL (KIND(0.0)), DIMENSION(2,jmax,lmax, Nbr_SSS) :: U2
REAL (KIND(0.0)), DIMENSION(2,jmax,lmax, Nbr_SSS) :: Var1
or possibly:
REAL (KIND(0.0)), DIMENSION(jmax,lmax, 2, Nbr_SSS) :: U2
REAL (KIND(0.0)), DIMENSION(jmax,lmax, 2, Nbr_SSS) :: Var1
Be aware the interations on jmax are on adjacent locations in memory.
The Fortran dimension order for adjacent memory locations is inverse that of C/C++.
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page