OpenMP with PARALLEL DO not correctly working

Serrano__Antonio · ‎09-02-2019

Hello: I am using ifort (IFORT) 17.0.4 20170411 I am trying to paralelize a DO loop using OpenMP. I compile using -qopenmp. The code is large and I suspect that I can easily overlook importatn parts if I try to write a minimal example. When I use the !$OMP PARALLELDO PRIVATE (...) ... !$OMP END PARALLELDO, only one thread is created and it makes the whole loop. When I use !$OMP PARALLEL ... !$OMP END PARALLEL, and inside of it I use !$OMP DO ... !$OMP END DO, all the 20 threads are created, and all of them are working, but each thred is running the whole loop, id est, the variable that serves as the counter of the do loop, takes, for each thread, all its values, so that the complete loop is being run 20 times, one per thread. It seems that the !$OMP PARALLEL is being completely ignored, but there are no warnings from the compiler. I use alloatable arrays that are allocated before the pallalelized loop, and are declared as private in the PRIVATE clause. Could this perhaps have something to do?. Here, some output for the case in which I declare the !$OMP PARALLEL ... !$OMP END PARALLEL and inside of it I declare !$OMP DO ... !$OMP END DO: Number of requested cores: 20 Thread_000 Problem_day_0000001 Thread_001 Problem_day_0000001 Thread_003 Problem_day_0000001 The loop counter is "Problem_day" When I omit the $OMP DO ... !$OMP END DO and make the partition of the loop by myself, the job is ditributed among the threads and it seems not to be any problem.

jimdempseyatthecove · ‎09-04-2019

>>When I use the !$OMP PARALLELDO PRIVATE (...) ... !$OMP END PARALLELDO, only one thread is created and it makes the whole loop.

Prior to !$OMP... are you already in a parallel region?

The behavior is as if you are, or have set the number of threads to 1.

Just prior to !$OMP PARALLELDO PRIVATE (...) insert

PRINT *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS()

You should see .FALSE. and 20

Jim Dempsey

Serrano__Antonio · ‎09-05-2019

Thanks for your answer, Jim. I have tested your suggestion and the result is as it should be: F 20 And the whole loop is still being run 20 times, one per thread.

jimdempseyatthecove · ‎09-05-2019

Can you copy and paste (using {...}code button) the exact text of the !$OMP statement inclusive of the Fortran DO statement.

Also, post your file name (iow is it xxx.f90, xxx.f, xxx.for, etc...) i.e. is the source compiled as fixed form or free form.

Jim Dempsey

Serrano__Antonio · ‎09-05-2019

Yes, Jimmi: The extension is .f08 (yes, I know it is not a good practice) and I compile with options -free and -c -Tf file_name.f08 Regarding the exact OMP directives, here they are: !$OMP PARALLEL PRIVATE (iProb, IdThread, NThreads, First, Last, Distances, Order, cObs, iStn, & !$OMP iNonNanAnal, iAnal, OrderNonNan, & !$OMP TandVector, i, TorsMatrix, & !$OMP PredtandDtrSlope, PredtandDtrIntercept, & !$OMP EstimationJulDay, & !$OMP mi, ccm, kvars, corpar, coe, con, & !$OMP Diff, NewDiff, intBestAnal) !$OMP DO ... !$OMP END DO !$OMP END PARALLEL

Serrano__Antonio · ‎09-09-2019

I have read the following post, were they say that it is not ehough to declare an allocatable array as private, I also have to declare it as copyin. But the results of the parallelized code are the same than those of the non-paralelize code. So I think that allocatable private arrays are working well. The post is: https://software.intel.com/es-es/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/371083

jimdempseyatthecove · ‎09-10-2019

Add private integer variables: myFirst, myLast

...
myFirst = 0
$omp do
do i=... ! use other index if not i
if(myFirst==0) myFirst=i !use other index if not i
myLast=i
... ! remainder of loop
end do
!$omp end do
print *, omp_get_thread_nim(), omp_in_parallel(), myFirst, myLast
$omp end parallel

Jim Dempsey

Serrano__Antonio · ‎09-12-2019

This is the output (I think you meant omp_get_thread_num). I use 20 threads: 10 T 1 1 15 T 1 1 6 T 1 1 9 T 1 1 7 T 1 1 1 T 1 1 8 T 1 1 16 T 1 1 11 T 1 1 14 T 1 1 12 T 1 1 19 T 1 1 2 T 1 1 13 T 1 1 5 T 1 1 3 T 1 1 0 T 1 1 17 T 1 1 4 T 1 1 18 T 1 1 7 T 1 2 16 T 1 2 19 T 1 2 11 T 1 2 10 T 1 2 6 T 1 2 2 T 1 2 14 T 1 2 17 T 1 2 4 T 1 2 0 T 1 2 3 T 1 2 12 T 1 2 5 T 1 2 1 T 1 2 13 T 1 2 9 T 1 2 8 T 1 2 15 T 1 2 18 T 1 2 7 T 1 3 16 T 1 3 11 T 1 3 6 T 1 3 10 T 1 3 2 T 1 3 14 T 1 3 19 T 1 3 17 T 1 3 4 T 1 3 3 T 1 3 5 T 1 3 0 T 1 3 12 T 1 3

Serrano__Antonio · ‎09-13-2019

I now have more information: When I compile with options: ifort -module intOz/obj/Release/ -fpp -check bounds -free -assume norealloc_lhs -qopenmp -xCORE-AVX2 -O2 -traceback -I../cdi/installed-intel-ozonosfera-1.6.2/include -I../FLAP/compiled-intel-ozonosfera/static/mod -c -Tf clasif_kmedias.f90 -o intOz/obj/Release/clasif-kemedias/clasif_kmedias.o the "!$OMP parallel" does works fine. But when I use the following: ifort -module intOz/obj/Debug/ -fpp -check all -free -assume norealloc_lhs -qopenmp -traceback -warn all -debug full -check noarg_temp_created -heap-arrays -I../cdi/installed-intel-ozonosfera-1.6.2/include -I../FLAP/compiled-intel-ozonosfera/static/mod -c -Tf clasif_kmedias.f90 -o intOz/obj/Debug/clasif-kemedias/clasif_kmedias.o the "!$OMP parallel" does not work.

jimdempseyatthecove · ‎09-15-2019

>>10 T 1 1

What is the iteration count of your DO loop???

?? less than number of threads?

It appears that you have an outer loop (not shown) that sets the iteration count of the inner (parallel) loop to 1, then 2, then 3, ...

This should be OK, though inefficient, to have an iteration count less than the thread count. It should work.

Can you make, and post a simple reproducer.

It is odd that the release version works but the debug version does not. Usually it is the other way around.

ifort -module intOz/obj/Release/ -fpp -check bounds -free -assume norealloc_lhs -qopenmp -xCORE-AVX2 -O2 -traceback                                                              -I../cdi/installed-intel-ozonosfera-1.6.2/include -I../FLAP/compiled-intel-ozonosfera/static/mod -c -Tf clasif_kmedias.f90 -o intOz/obj/Release/clasif-kemedias/clasif_kmedias.o
ifort -module intOz/obj/Debug/   -fpp -check all    -free -assume norealloc_lhs -qopenmp                 -traceback -warn all -debug full -check noarg_temp_created -heap-arrays -I../cdi/installed-intel-ozonosfera-1.6.2/include -I../FLAP/compiled-intel-ozonosfera/static/mod -c -Tf clasif_kmedias.f90 -o intOz/obj/Debug  /clasif-kemedias/clasif_kmedias.o

In trying to isolate the issue (such that you can work around the problem), try changing only the Debug build, one option at a time, to that of the release build, you may find the offending option.

A simple reproducer can be posted here and submitted to Intel.

Jim Dempsey

philliard · ‎05-17-2020

Was this ever resolved? I have a very similar problem with a code that I used to have running in parallel but something has changed and I cannot figure out why it doesn't work any more. I simplified down to this:

        print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS()
                        
        !$OMP PARALLEL
            print *, 'hello from thread:', OMP_GET_THREAD_NUM()
        !$OMP END PARALLEL
                            
        print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS()
                        
        req%Aii(1) = percentComplete
        req%Aii(2) = 100. - percentComplete
                            
        localCount = 0
                        
        !$OMP PARALLEL DO default (shared) private (i, percentToDo)
            do i = 1, 10
                localCount = localCount + 1
                percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + OMP_GET_THREAD_NUM()
                print *, i, percentToDo, OMP_GET_THREAD_NUM()
            end do
        !$OMP END PARALLEL DO
        print *, localCount    
        print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS()
                        
        localCount = 0
                        
        !$OMP PARALLEL default (shared) private (i, percentToDo)
        !$OMP DO
            do i = 1, 10
                localCount = localCount + 1
                percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + OMP_GET_THREAD_NUM()
                print *, i, percentToDo, OMP_GET_THREAD_NUM()
            end do
        !$OMP END DO
        !$OMP END PARALLEL 
                            
        print *, localCount
        print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS()

The first PARALLEL region works perfectly

F           8
hello from thread:           0
hello from thread:           6
hello from thread:           5
hello from thread:           3
hello from thread:           1
hello from thread:           2
hello from thread:           4
hello from thread:           7
F           8

The PARALLEL DO runs each iteration on thread 0.

           1 0.862318873405457                0
           2 0.862318873405457                0
           3 0.862318873405457                0
           4 0.862318873405457                0
           5 0.862318873405457                0
           6 0.862318873405457                0
           7 0.862318873405457                0
           8 0.862318873405457                0
           9 0.862318873405457                0
          10 0.862318873405457                0
          10

The region where I put OMP PARALLEL followed by OMP DO executes the entire loop for every thread.

F           8
           1 0.862318873405457                0
           1   5.86231899261475                5
           2   5.86231899261475                5
           3   5.86231899261475                5
           4   5.86231899261475                5
           1   6.86231899261475                6
           5   5.86231899261475                5
           1   3.86231899261475                3
           6   5.86231899261475                5

...

           7   1.86231887340546                1
           8   1.86231887340546                1
           9   1.86231887340546                1
          10   1.86231887340546                1
          80
F           8

Any insight would be appreciated

jimdempseyatthecove · ‎05-17-2020

The compiler will attempt to optimize the Parallel DO. In this specific case, the compiler determined that the work performed within the parallel region was insufficient to use more than one thread. For this specific loop, if you wish to force the loop to partition the iteration space to one index, then add the clause: SCHEDULE(static, 1):

!$OMP PARALLEL DO default (shared) private (i, percentToDo) schedule(static, 1) 
do i = 1, 10
  localCount = localCount + 1
  percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + OMP_GET_THREAD_NUM()
  print *, i, percentToDo, OMP_GET_THREAD_NUM()
end do
!$OMP END PARALLEL DO
 ... 
!$OMP PARALLEL default (shared) private (i, percentToDo)
!$OMP DO schedule(static, 1)
do i = 1, 10
  localCount = localCount + 1
  percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + OMP_GET_THREAD_NUM()
  print *, i, percentToDo, OMP_GET_THREAD_NUM()
end do
!$OMP END DO 
!$OMP END PARALLEL

Jim Dempsey

philliard · ‎05-17-2020

Thanks, Jim.

As far as I can tell, schedule(static, 1) seems to have no effect on either loop.

Conceptually, I thought that

        !$OMP PARALLEL DO default (shared) private (percentToDo)
            do i = 1, 10

and

        !$OMP PARALLEL default (shared) private (i, percentToDo) 
        !$OMP DO
            do i = 1, 10

were supposed to be functionally equivalent, so it seems telling that the second version seems to execute the entire loop for each thread (total of 10 * 8 = 80 passes). It seems like it is ignoring the "!$OMP DO" completely.

This code snipped was simplified from a code that does much more work in the PARALLEL DO. It was at one point working, but no longer. The only substantial change I can think of is that I am using a newer version of the compiler (although not latest). Currently using 17.0.8.275 (64 bit).

jimdempseyatthecove · ‎05-17-2020

program foo
    use omp_lib
    implicit none
    type req_t
        real :: Aii(10)
    end type req_t
    type(req_t) :: req
    integer :: i, localCount
    real :: percentToDo, percentComplete
    print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS()
                
    !$OMP PARALLEL
        print *, 'hello from thread:', OMP_GET_THREAD_NUM()
    !$OMP END PARALLEL
                    
    print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS()
    percentComplete = 0.0      
    req%Aii(1) = percentComplete
    req%Aii(2) = 100. - percentComplete
                    
    localCount = 0
                
    !$OMP PARALLEL DO default (shared) private (i, percentToDo)
        do i = 1, 10
            localCount = localCount + 1
            percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + OMP_GET_THREAD_NUM()
            print *, i, percentToDo, OMP_GET_THREAD_NUM()
        end do
    !$OMP END PARALLEL DO
    print *, localCount    
    print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS()
                
    localCount = 0
                
    !$OMP PARALLEL default (shared) private (i, percentToDo)
    !$OMP DO
        do i = 1, 10
            localCount = localCount + 1
            percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + OMP_GET_THREAD_NUM()
            print *, i, percentToDo, OMP_GET_THREAD_NUM()
        end do
    !$OMP END DO
    !$OMP END PARALLEL 
                    
    print *, localCount
    print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS()
end program foo
============= output ============
 F           8
 hello from thread:           1
 hello from thread:           2
 hello from thread:           7
 hello from thread:           3
 hello from thread:           5
 hello from thread:           4
 hello from thread:           6
 hello from thread:           0
 F           8
           6   3.862319               3
           1  0.8623189               0
          10   7.862319               7
           5   2.862319               2
           3   1.862319               1
           7   4.862319               4
           8   5.862319               5
           4   1.862319               1
           9   6.862319               6
           2  0.8623189               0
          10
 F           8
           1  0.8623189               0
           8   5.862319               5
           9   6.862319               6
           7   4.862319               4
           5   2.862319               2
           2  0.8623189               0
           3   1.862319               1
           4   1.862319               1
           6   3.862319               3
          10   7.862319               7
           9
 F           8

And:

program foo
    use omp_lib
    implicit none
    type req_t
        real :: Aii(10)
    end type req_t
    type(req_t) :: req
    integer :: i, localCount
    real :: percentToDo, percentComplete
    print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS()
                
    !$OMP PARALLEL
        print *, 'hello from thread:', OMP_GET_THREAD_NUM()
    !$OMP END PARALLEL
                    
    print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS()
    percentComplete = 0.0      
    req%Aii(1) = percentComplete
    req%Aii(2) = 100. - percentComplete
                    
    localCount = 0
                
    !$OMP PARALLEL DO default (shared) private (i, percentToDo) schedule(static, 1)
        do i = 1, 10
            localCount = localCount + 1
            percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + OMP_GET_THREAD_NUM()
            print *, i, percentToDo, OMP_GET_THREAD_NUM()
        end do
    !$OMP END PARALLEL DO
    print *, localCount    
    print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS()
                
    localCount = 0
                
    !$OMP PARALLEL default (shared) private (i, percentToDo)
    !$OMP DO schedule(static, 1)
        do i = 1, 10
            localCount = localCount + 1
            percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + OMP_GET_THREAD_NUM()
            print *, i, percentToDo, OMP_GET_THREAD_NUM()
        end do
    !$OMP END DO
    !$OMP END PARALLEL 
                    
    print *, localCount
    print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS()
end program foo
====================== output ================
 F           8
 hello from thread:           1
 hello from thread:           0
 hello from thread:           6
 hello from thread:           2
 hello from thread:           7
 hello from thread:           3
 hello from thread:           4
 hello from thread:           5
 F           8
           7   6.862319               6
           2   1.862319               1
           6   5.862319               5
           3   2.862319               2
           4   3.862319               3
          10   1.862319               1
           1  0.8623189               0
           8   7.862319               7
           9  0.8623189               0
           5   4.862319               4
          10
 F           8
           1  0.8623189               0
           2   1.862319               1
           9  0.8623189               0
           4   3.862319               3
           7   6.862319               6
           5   4.862319               4
           6   5.862319               5
          10   1.862319               1
           8   7.862319               7
           3   2.862319               2
          10
 F           8

On my system the schedule(static, 1) was not required.

Intel® Parallel Studio XE 2020 Composer Edition for Fortran Windows* Package ID: w_comp_lib_2020.0.166
Intel® Parallel Studio XE 2020 Composer Edition for Fortran Windows* Integration for Microsoft Visual Studio* 2019,
Version 19.1.0055.16, Copyright © 2002-2019 Intel Corporation. All rights reserved.

Not sure what is going on on your end.

Copy and paste the sample code I have above (sans the output). See what happens.

Jim Dempsey

eliopoulos · ‎12-10-2023

It doesn't work for me. It doesn't build the solution:

Error fatal error LNK1120: 3 unresolved externals x64\Debug\zoo.exe
Error error LNK2019: unresolved external symbol omp_get_max_threads referenced in function MAIN__ zoo.obj
Error error LNK2019: unresolved external symbol omp_in_parallel referenced in function MAIN__ zoo.obj
Error error LNK2019: unresolved external symbol omp_get_thread_num referenced in function MAIN__ zoo.obj

Should I write anything in the command line?

I have installed Visual Studio 2019 Community and Intel oneapi 2021 base and HPC toolkit.

Arjen_Markus · ‎12-10-2023

You are reacting to an old conversation, it is better to start a new one. But from the error messages you post, it looks as if you have not turned on the OpenMP option in the properties of your program's project.

eliopoulos · ‎12-11-2023

It works now. Thank you.

philliard · ‎05-17-2020

It is really bizarre. I did the same as you... I created a standalone console App and pasted my code in. It works perfectly the same as you demonstrated.

My production code is actually a DLL that is called by another program (and it also is compiled with several static libraries). In my little standalone test code, I also made a static library, compiled into a DLL and called with a different code. It still works correctly.

I went line-by-line through the settings for the project of the test code and the production code and they are pretty much the same. I made the settings in the test code match those in the production code and the test code still works.

I also did this, although it is ugly it seems to work. Instead of using OMP DO, I just made the region parallel and did this...

                        if (OMP_GET_THREAD_NUM() == omp_get_max_threads() - 1) then
                            iEnd = localCount
                        else
                            iEnd   = (OMP_GET_THREAD_NUM() + 1) * (localCount / omp_get_max_threads())
                        end if
                        
                        iStart = (OMP_GET_THREAD_NUM()    ) * (localCount / omp_get_max_threads()) + 1
  
                        ! main loop
                        do i = iStart, iEnd

The do loop used to be "do i = 1, localCount", so I'm just splitting the loop up manually for each thread. Perhaps there is a way to do this more efficiently.

jimdempseyatthecove · ‎05-17-2020

>>My production code is actually a DLL that is called by another program

If the calling code is also using OpenMP .AND. calling your DLL from within a parallel region .AND. (default) nested parallelism is disabled, then any attempt at nesting parallel regions will result in 1 thread at nest level(s).

omp_get_max_threads() does not necessarily return the number of thread for a parallel region. omp_num_threads() will.

If !$OMP PARALLEL produces a parallel region (with multiple threads) and !$OMP PARALLEL DO does not see what

allocate(threadRan(0:omp_get_max_threads()-1)) ! LOGICAL
threadRan = .false.
!$OMP PARALLEL
nThreadsParallel = omp_get_num_threads() ! integer
!$OMP DO ...
DO I=1,10
   threadRan(omp_get_thread_num()) = .true.
   ...
end DO
!$omp end do
!$omp end parallel
print *, nThreadsParallel
print *,threadRan
...

Jim Dempsey

John_Campbell · ‎05-17-2020

In Quote #11, you could try the following changes as localCount is shared. ( or use !$OMP atomic ?)

 !$OMP PARALLEL DO default (shared) private (i, id, percentToDo) reduction (+ : localCount)
    do i = 1, 10
        id = OMP_GET_THREAD_NUM()
        localCount = localCount + 1
        percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + id
        print *, i, percentToDo, id, OMP_GET_MAX_THREADS(), OMP_IN_PARALLEL()
    end do
 !$OMP END PARALLEL DO
    print *, localCount    
    print *, i, percentToDo, id, OMP_GET_MAX_THREADS(), OMP_IN_PARALLEL(),' is i=11 or 0 ?'

philliard · ‎05-20-2020

As an update, I was not able to figure out why my production code will not execute the PARALLEL DO. In answer to post #16, I do not have any nested parallel regions. I only have a single parallel region.

I have sort of a workaround where I just create a parallel region with !$OMP PARALLEL and then in each thread I adjust the upper and lower bounds of the do loop based on the thread ID. So basically just breaking the items into evenly spaced chunks and executing each chunk on a different thread.

However, this does not give me nearly the performance that I have seen in this code in the past. I used to get performance gains up to about 8 processors (on an 8 core machine). However, now I barely see any improvement past 2-4. I suspect my manual splitup is just not balancing the load well. I probably have some threads finishing up much earlier than others.

Here is where it gets more interesting. I have been using Fortran 2017 Update 8. I first asked a colleague to compile this with Version 2019 (~update 2 I think). It did run the code in parallel, but it was much slower.

I downloaded and installed Fortran 2020 (Update 1). The good news is that this code does multi-thread my PARALLEL DO loop. The bad news is the code is MUCH slower than the 2017 compiled code. Using 8 cores, the code is about 6 times slower than the code compiled in Fortran 2017 (which is running single threaded because of the afformentioned bug). Single threaded with the new compiler, it is 11 times slower.

2017 compiled   1 thread   wall clock time = 5.5 seconds
2019 compiled   1 thread   wall clock time = 59 seconds
2019 compiled   2 thread   wall clock time = 41 seconds
2019 compiled   4 thread   wall clock time = 34 seconds
2019 compiled   8 threads   wall clock time = 32 seconds

In past versions of this code, this same problem should be running in ~1 second.

I'm going to try to open a support ticket with Intel. Our company is going to Visual Studio 2019 and so its going to be tough for me to stick with Fortran 2017, but we can't live with this kind of slowdown. This code was originally developed I believe with Intel 2010 and since we have gone through 2013, 2015 and 2017 without issues. I wonder what could have changed to make my code so much slower.

I looked through all of the project settings and I see nowhere I could set any kind of compatibility or anything.