OpenMP with PARALLEL DO not correctly working - Page 2

Serrano__Antonio · ‎09-02-2019

Hello: I am using ifort (IFORT) 17.0.4 20170411 I am trying to paralelize a DO loop using OpenMP. I compile using -qopenmp. The code is large and I suspect that I can easily overlook importatn parts if I try to write a minimal example. When I use the !$OMP PARALLELDO PRIVATE (...) ... !$OMP END PARALLELDO, only one thread is created and it makes the whole loop. When I use !$OMP PARALLEL ... !$OMP END PARALLEL, and inside of it I use !$OMP DO ... !$OMP END DO, all the 20 threads are created, and all of them are working, but each thred is running the whole loop, id est, the variable that serves as the counter of the do loop, takes, for each thread, all its values, so that the complete loop is being run 20 times, one per thread. It seems that the !$OMP PARALLEL is being completely ignored, but there are no warnings from the compiler. I use alloatable arrays that are allocated before the pallalelized loop, and are declared as private in the PRIVATE clause. Could this perhaps have something to do?. Here, some output for the case in which I declare the !$OMP PARALLEL ... !$OMP END PARALLEL and inside of it I declare !$OMP DO ... !$OMP END DO: Number of requested cores: 20 Thread_000 Problem_day_0000001 Thread_001 Problem_day_0000001 Thread_003 Problem_day_0000001 The loop counter is "Problem_day" When I omit the $OMP DO ... !$OMP END DO and make the partition of the loop by myself, the job is ditributed among the threads and it seems not to be any problem.

philliard · ‎05-20-2020

John Campbell wrote:

In Quote #11, you could try the following changes as localCount is shared. ( or use !$OMP atomic ?)

 !$OMP PARALLEL DO default (shared) private (i, id, percentToDo) reduction (+ : localCount)
    do i = 1, 10
        id = OMP_GET_THREAD_NUM()
        localCount = localCount + 1
        percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + id
        print *, i, percentToDo, id, OMP_GET_MAX_THREADS(), OMP_IN_PARALLEL()
    end do
 !$OMP END PARALLEL DO
    print *, localCount    
    print *, i, percentToDo, id, OMP_GET_MAX_THREADS(), OMP_IN_PARALLEL(),' is i=11 or 0 ?'

Thanks John, but the value of that LocalCount variable is not really material. I just added it in because I wanted to see how many times it was executing the loop. It acts the same with or without that variable. The problem I am having is that it seems to be just ignoring the OMP DO altogether. If I create a parallel region with OMP PARALLEL followed by OMP DO, each thread just executes the whole loop meaning 8 * 10 executions and if I use PARALLEL DO, it just executes the entire loop on 1 thread.

John_Campbell · ‎05-20-2020

Philliard,

You indicated in post #11 that the !$OMP region was remaining as single thread. I have seen no indication why this happened. If you print out the value of i after the DO, it will be 11 for a non-OMP or undefined if OMP was active. There can be poor warnings if !$OMP directives are ignored. I prefer to have explicit private and shared, to at least review how arrays are being used and replicated.

You say that OMP is not working, by comparing 2017 to 2019 performance, but the most significant change is for the non-omp (5.5 sec to 59 sec). What is the reason for this ? You need an explanation. (different test, bigger problem, reduced cacheing ?) It is a big change.

The other issue is OMP efficiency (8 threads:32 sec vs 1 thread:59 sec). There is not good efficiency there. Make sure you are comparing the OMP region performance. Then there are the inefficiency possibilities. Count the number of OMP region entries ( @ 2.e-5 sec per entry ), memory clash for updating shared arrays, have memory transfer demands increased.

philliard · ‎05-21-2020

John Campbell wrote:
Philliard,
You indicated in post #11 that the !$OMP region was remaining as single thread. I have seen no indication why this happened. If you print out the value of i after the DO, it will be 11 for a non-OMP or undefined if OMP was active. There can be poor warnings if !$OMP directives are ignored. I prefer to have explicit private and shared, to at least review how arrays are being used and replicated.
You say that OMP is not working, by comparing 2017 to 2019 performance, but the most significant change is for the non-omp (5.5 sec to 59 sec). What is the reason for this ? You need an explanation. (different test, bigger problem, reduced cacheing ?) It is a big change.
The other issue is OMP efficiency (8 threads:32 sec vs 1 thread:59 sec). There is not good efficiency there. Make sure you are comparing the OMP region performance. Then there are the inefficiency possibilities. Count the number of OMP region entries ( @ 2.e-5 sec per entry ), memory clash for updating shared arrays, have memory transfer demands increased.

Sorry, I mis-labeled the test cases - the compiler was 2020 (update 1), not 2019.

The code and the test problem I am running are identical between the 2017 and 2019 versions. The only difference in the test cases are the code being compiled with the different compiler. With the 2017 compiled version, no matter how many cores I request, I get the same execution time and I can tell by watching task manager it is only using one thread.

I know that I am looking at timing for the whole code and not just the parallel region and so I know that there are a lot of inefficiencies in the multi-threading result. However, my biggest concern for right now is why is it so much slower with the 2019 compiled version. My code is the same and the test case is the same so something has changed in the compiler and apparently something in my code combined with changes in the compiler has caused the code to slow way down.

jimdempseyatthecove · ‎05-21-2020

Due to the timing variations between versions being approximately 10x, this is indicative of potentially several things:

1) One compilation using array bounds checking and the other not,
2) You have sensitive verses insensitive convergence code where one version converges in much fewer iterations than the other
3) Your 8 threads are running on one hardware thread on one verson, and on all hardware threads on a different version

Note for 3). You state that your code is a DLL that is using OpenMP. Note that if the executable that calls your DLL sets the process affinity to use 1 logical processor (1 hardware thread), and you DLL code instructs OpenMP to use 8 threads, then those 8 threads will run on that single logical processor.

Jim Dempsey

jimdempseyatthecove · ‎12-10-2023

Assure you have use omp_lib as well

program YourProgram
    use omp_lib
    ....

Jim Dempsey