Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
431 Views

OpenMP with PARALLEL DO not correctly working

Hello: I am using ifort (IFORT) 17.0.4 20170411 I am trying to paralelize a DO loop using OpenMP. I compile using -qopenmp. The code is large and I suspect that I can easily overlook importatn parts if I try to write a minimal example. When I use the !$OMP PARALLELDO PRIVATE (...) ... !$OMP END PARALLELDO, only one thread is created and it makes the whole loop. When I use !$OMP PARALLEL ... !$OMP END PARALLEL, and inside of it I use !$OMP DO ... !$OMP END DO, all the 20 threads are created, and all of them are working, but each thred is running the whole loop, id est, the variable that serves as the counter of the do loop, takes, for each thread, all its values, so that the complete loop is being run 20 times, one per thread. It seems that the !$OMP PARALLEL is being completely ignored, but there are no warnings from the compiler. I use alloatable arrays that are allocated before the pallalelized loop, and are declared as private in the PRIVATE clause. Could this perhaps have something to do?. Here, some output for the case in which I declare the !$OMP PARALLEL ... !$OMP END PARALLEL and inside of it I declare !$OMP DO ... !$OMP END DO: Number of requested cores: 20 Thread_000 Problem_day_0000001 Thread_001 Problem_day_0000001 Thread_003 Problem_day_0000001 The loop counter is "Problem_day" When I omit the $OMP DO ... !$OMP END DO and make the partition of the loop by myself, the job is ditributed among the threads and it seems not to be any problem.
0 Kudos
21 Replies
Highlighted
419 Views

>>When I use the !$OMP PARALLELDO PRIVATE (...) ... !$OMP END PARALLELDO, only one thread is created and it makes the whole loop.

Prior to !$OMP... are you already in a parallel region?

The behavior is as if you are, or have set the number of threads to 1.

Just prior to  !$OMP PARALLELDO PRIVATE (...) insert

PRINT *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS()

You should see .FALSE. and 20

Jim Dempsey

0 Kudos
Highlighted
419 Views

Thanks for your answer, Jim. I have tested your suggestion and the result is as it should be: F 20 And the whole loop is still being run 20 times, one per thread.
0 Kudos
Highlighted
419 Views

Can you copy and paste (using {...}code button) the exact text of the !$OMP statement inclusive of the Fortran DO statement.

Also, post your file name (iow is it xxx.f90, xxx.f, xxx.for, etc...) i.e. is the source compiled as fixed form or free form.

Jim Dempsey

0 Kudos
Highlighted
419 Views

Yes, Jimmi: The extension is .f08 (yes, I know it is not a good practice) and I compile with options -free and -c -Tf file_name.f08 Regarding the exact OMP directives, here they are: !$OMP PARALLEL PRIVATE (iProb, IdThread, NThreads, First, Last, Distances, Order, cObs, iStn, & !$OMP iNonNanAnal, iAnal, OrderNonNan, & !$OMP TandVector, i, TorsMatrix, & !$OMP PredtandDtrSlope, PredtandDtrIntercept, & !$OMP EstimationJulDay, & !$OMP mi, ccm, kvars, corpar, coe, con, & !$OMP Diff, NewDiff, intBestAnal) !$OMP DO ... !$OMP END DO !$OMP END PARALLEL
0 Kudos
Highlighted
419 Views

I have read the following post, were they say that it is not ehough to declare an allocatable array as private, I also have to declare it as copyin. But the results of the parallelized code are the same than those of the non-paralelize code. So I think that allocatable private arrays are working well. The post is: https://software.intel.com/es-es/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/371083
0 Kudos
Highlighted
419 Views

Add private integer variables: myFirst, myLast

...
myFirst = 0
$omp do
do i=... ! use other index if not i
  if(myFirst==0) myFirst=i !use other index if not i
  myLast=i
  ... ! remainder of loop
end do
!$omp end do
print *, omp_get_thread_nim(), omp_in_parallel(), myFirst, myLast
$omp end parallel

Jim Dempsey

0 Kudos
Highlighted
419 Views

This is the output (I think you meant omp_get_thread_num). I use 20 threads: 10 T 1 1 15 T 1 1 6 T 1 1 9 T 1 1 7 T 1 1 1 T 1 1 8 T 1 1 16 T 1 1 11 T 1 1 14 T 1 1 12 T 1 1 19 T 1 1 2 T 1 1 13 T 1 1 5 T 1 1 3 T 1 1 0 T 1 1 17 T 1 1 4 T 1 1 18 T 1 1 7 T 1 2 16 T 1 2 19 T 1 2 11 T 1 2 10 T 1 2 6 T 1 2 2 T 1 2 14 T 1 2 17 T 1 2 4 T 1 2 0 T 1 2 3 T 1 2 12 T 1 2 5 T 1 2 1 T 1 2 13 T 1 2 9 T 1 2 8 T 1 2 15 T 1 2 18 T 1 2 7 T 1 3 16 T 1 3 11 T 1 3 6 T 1 3 10 T 1 3 2 T 1 3 14 T 1 3 19 T 1 3 17 T 1 3 4 T 1 3 3 T 1 3 5 T 1 3 0 T 1 3 12 T 1 3
0 Kudos
Highlighted
419 Views

I now have more information: When I compile with options: ifort -module intOz/obj/Release/ -fpp -check bounds -free -assume norealloc_lhs -qopenmp -xCORE-AVX2 -O2 -traceback -I../cdi/installed-intel-ozonosfera-1.6.2/include -I../FLAP/compiled-intel-ozonosfera/static/mod -c -Tf clasif_kmedias.f90 -o intOz/obj/Release/clasif-kemedias/clasif_kmedias.o the "!$OMP parallel" does works fine. But when I use the following: ifort -module intOz/obj/Debug/ -fpp -check all -free -assume norealloc_lhs -qopenmp -traceback -warn all -debug full -check noarg_temp_created -heap-arrays -I../cdi/installed-intel-ozonosfera-1.6.2/include -I../FLAP/compiled-intel-ozonosfera/static/mod -c -Tf clasif_kmedias.f90 -o intOz/obj/Debug/clasif-kemedias/clasif_kmedias.o the "!$OMP parallel" does not work.
0 Kudos
Highlighted
419 Views

>>10 T 1 1

What is the iteration count of your DO loop???

?? less than number of threads?

It appears that you have an outer loop (not shown) that sets the iteration count of the inner (parallel) loop to 1, then 2, then 3, ...

This should be OK, though inefficient, to have an iteration count less than the thread count. It should work.

Can you make, and post a simple reproducer.

It is odd that the release version works but the debug version does not. Usually it is the other way around.

ifort -module intOz/obj/Release/ -fpp -check bounds -free -assume norealloc_lhs -qopenmp -xCORE-AVX2 -O2 -traceback                                                              -I../cdi/installed-intel-ozonosfera-1.6.2/include -I../FLAP/compiled-intel-ozonosfera/static/mod -c -Tf clasif_kmedias.f90 -o intOz/obj/Release/clasif-kemedias/clasif_kmedias.o
ifort -module intOz/obj/Debug/   -fpp -check all    -free -assume norealloc_lhs -qopenmp                 -traceback -warn all -debug full -check noarg_temp_created -heap-arrays -I../cdi/installed-intel-ozonosfera-1.6.2/include -I../FLAP/compiled-intel-ozonosfera/static/mod -c -Tf clasif_kmedias.f90 -o intOz/obj/Debug  /clasif-kemedias/clasif_kmedias.o

In trying to isolate the issue (such that you can work around the problem), try changing only the Debug build, one option at a time, to that of the release build, you may find the offending option.

A simple reproducer can be posted here and submitted to Intel.

Jim Dempsey

0 Kudos
Highlighted
Beginner
419 Views

Was this ever resolved?  I have a very similar problem with a code that I used to have running in parallel but something has changed and I cannot figure out why it doesn't work any more.  I simplified down to this:

        print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS()
                        
        !$OMP PARALLEL
            print *, 'hello from thread:', OMP_GET_THREAD_NUM()
        !$OMP END PARALLEL
                            
        print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS()
                        
        req%Aii(1) = percentComplete
        req%Aii(2) = 100. - percentComplete
                            
        localCount = 0
                        
        !$OMP PARALLEL DO default (shared) private (i, percentToDo)
            do i = 1, 10
                localCount = localCount + 1
                percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + OMP_GET_THREAD_NUM()
                print *, i, percentToDo, OMP_GET_THREAD_NUM()
            end do
        !$OMP END PARALLEL DO
        print *, localCount    
        print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS()
                        
        localCount = 0
                        
        !$OMP PARALLEL default (shared) private (i, percentToDo)
        !$OMP DO
            do i = 1, 10
                localCount = localCount + 1
                percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + OMP_GET_THREAD_NUM()
                print *, i, percentToDo, OMP_GET_THREAD_NUM()
            end do
        !$OMP END DO
        !$OMP END PARALLEL 
                            
        print *, localCount
        print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS()

The first PARALLEL region works perfectly

 F           8
 hello from thread:           0
 hello from thread:           6
 hello from thread:           5
 hello from thread:           3
 hello from thread:           1
 hello from thread:           2
 hello from thread:           4
 hello from thread:           7
 F           8

The PARALLEL DO runs each iteration on thread 0. 

           1  0.862318873405457                0
           2  0.862318873405457                0
           3  0.862318873405457                0
           4  0.862318873405457                0
           5  0.862318873405457                0
           6  0.862318873405457                0
           7  0.862318873405457                0
           8  0.862318873405457                0
           9  0.862318873405457                0
          10  0.862318873405457                0
          10

The region where I put OMP PARALLEL followed by OMP DO executes the entire loop for every thread.

 F           8
           1  0.862318873405457                0
           1   5.86231899261475                5
           2   5.86231899261475                5
           3   5.86231899261475                5
           4   5.86231899261475                5
           1   6.86231899261475                6
           5   5.86231899261475                5
           1   3.86231899261475                3
           6   5.86231899261475                5

...

           7   1.86231887340546                1
           8   1.86231887340546                1
           9   1.86231887340546                1
          10   1.86231887340546                1
          80
 F           8

Any insight would be appreciated

0 Kudos
Highlighted
419 Views

The compiler will attempt to optimize the Parallel DO. In this specific case, the compiler determined that the work performed within the parallel region was insufficient to use more than one thread. For this specific loop, if you wish to force the loop to partition the iteration space to one index, then add the clause: SCHEDULE(static, 1):

!$OMP PARALLEL DO default (shared) private (i, percentToDo) schedule(static, 1) 
do i = 1, 10
  localCount = localCount + 1
  percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + OMP_GET_THREAD_NUM()
  print *, i, percentToDo, OMP_GET_THREAD_NUM()
end do
!$OMP END PARALLEL DO
 ... 
!$OMP PARALLEL default (shared) private (i, percentToDo)
!$OMP DO schedule(static, 1)
do i = 1, 10
  localCount = localCount + 1
  percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + OMP_GET_THREAD_NUM()
  print *, i, percentToDo, OMP_GET_THREAD_NUM()
end do
!$OMP END DO 
!$OMP END PARALLEL

Jim Dempsey

0 Kudos
Highlighted
Beginner
419 Views

Thanks, Jim.

As far as I can tell, schedule(static, 1) seems to have no effect on either loop.

Conceptually, I thought that

        !$OMP PARALLEL DO default (shared) private (percentToDo)
            do i = 1, 10

and

        !$OMP PARALLEL default (shared) private (i, percentToDo) 
        !$OMP DO
            do i = 1, 10

were supposed to be functionally equivalent, so it seems telling that the second version seems to execute the entire loop for each thread (total of 10 * 8 = 80 passes).  It seems like it is ignoring the "!$OMP DO" completely.

This code snipped was simplified from a code that does much more work in the PARALLEL DO.  It was at one point working, but no longer.  The only substantial change I can think of is that I am using a newer version of the compiler (although not latest).  Currently using 17.0.8.275 (64 bit).

0 Kudos
Highlighted
419 Views

program foo
    use omp_lib
    implicit none
    type req_t
        real :: Aii(10)
    end type req_t
    type(req_t) :: req
    integer :: i, localCount
    real :: percentToDo, percentComplete
    print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS()
                
    !$OMP PARALLEL
        print *, 'hello from thread:', OMP_GET_THREAD_NUM()
    !$OMP END PARALLEL
                    
    print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS()
    percentComplete = 0.0      
    req%Aii(1) = percentComplete
    req%Aii(2) = 100. - percentComplete
                    
    localCount = 0
                
    !$OMP PARALLEL DO default (shared) private (i, percentToDo)
        do i = 1, 10
            localCount = localCount + 1
            percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + OMP_GET_THREAD_NUM()
            print *, i, percentToDo, OMP_GET_THREAD_NUM()
        end do
    !$OMP END PARALLEL DO
    print *, localCount    
    print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS()
                
    localCount = 0
                
    !$OMP PARALLEL default (shared) private (i, percentToDo)
    !$OMP DO
        do i = 1, 10
            localCount = localCount + 1
            percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + OMP_GET_THREAD_NUM()
            print *, i, percentToDo, OMP_GET_THREAD_NUM()
        end do
    !$OMP END DO
    !$OMP END PARALLEL 
                    
    print *, localCount
    print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS()
end program foo
============= output ============
 F           8
 hello from thread:           1
 hello from thread:           2
 hello from thread:           7
 hello from thread:           3
 hello from thread:           5
 hello from thread:           4
 hello from thread:           6
 hello from thread:           0
 F           8
           6   3.862319               3
           1  0.8623189               0
          10   7.862319               7
           5   2.862319               2
           3   1.862319               1
           7   4.862319               4
           8   5.862319               5
           4   1.862319               1
           9   6.862319               6
           2  0.8623189               0
          10
 F           8
           1  0.8623189               0
           8   5.862319               5
           9   6.862319               6
           7   4.862319               4
           5   2.862319               2
           2  0.8623189               0
           3   1.862319               1
           4   1.862319               1
           6   3.862319               3
          10   7.862319               7
           9
 F           8

And:

program foo
    use omp_lib
    implicit none
    type req_t
        real :: Aii(10)
    end type req_t
    type(req_t) :: req
    integer :: i, localCount
    real :: percentToDo, percentComplete
    print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS()
                
    !$OMP PARALLEL
        print *, 'hello from thread:', OMP_GET_THREAD_NUM()
    !$OMP END PARALLEL
                    
    print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS()
    percentComplete = 0.0      
    req%Aii(1) = percentComplete
    req%Aii(2) = 100. - percentComplete
                    
    localCount = 0
                
    !$OMP PARALLEL DO default (shared) private (i, percentToDo) schedule(static, 1)
        do i = 1, 10
            localCount = localCount + 1
            percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + OMP_GET_THREAD_NUM()
            print *, i, percentToDo, OMP_GET_THREAD_NUM()
        end do
    !$OMP END PARALLEL DO
    print *, localCount    
    print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS()
                
    localCount = 0
                
    !$OMP PARALLEL default (shared) private (i, percentToDo)
    !$OMP DO schedule(static, 1)
        do i = 1, 10
            localCount = localCount + 1
            percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + OMP_GET_THREAD_NUM()
            print *, i, percentToDo, OMP_GET_THREAD_NUM()
        end do
    !$OMP END DO
    !$OMP END PARALLEL 
                    
    print *, localCount
    print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS()
end program foo
====================== output ================
 F           8
 hello from thread:           1
 hello from thread:           0
 hello from thread:           6
 hello from thread:           2
 hello from thread:           7
 hello from thread:           3
 hello from thread:           4
 hello from thread:           5
 F           8
           7   6.862319               6
           2   1.862319               1
           6   5.862319               5
           3   2.862319               2
           4   3.862319               3
          10   1.862319               1
           1  0.8623189               0
           8   7.862319               7
           9  0.8623189               0
           5   4.862319               4
          10
 F           8
           1  0.8623189               0
           2   1.862319               1
           9  0.8623189               0
           4   3.862319               3
           7   6.862319               6
           5   4.862319               4
           6   5.862319               5
          10   1.862319               1
           8   7.862319               7
           3   2.862319               2
          10
 F           8

On my system the schedule(static, 1) was not required.

Intel® Parallel Studio XE 2020 Composer Edition for Fortran Windows*   Package ID: w_comp_lib_2020.0.166
Intel® Parallel Studio XE 2020 Composer Edition for Fortran Windows* Integration for Microsoft Visual Studio* 2019,
  Version 19.1.0055.16, Copyright © 2002-2019 Intel Corporation. All rights reserved.

Not sure what is going on on your end.

Copy and paste the sample code I have above (sans the output). See what happens.

Jim Dempsey

0 Kudos
Highlighted
Beginner
419 Views

It is really bizarre.  I did the same as you... I created a standalone console App and pasted my code in.  It works perfectly the same as you demonstrated.

My production code is actually a DLL that is called by another program (and it also is compiled with several static libraries).  In my little standalone test code, I also made a static library, compiled into a DLL and called with a different code.  It still works correctly.

I went line-by-line through the settings for the project of the test code and the production code and they are pretty much the same.  I made the settings in the test code match those in the production code and the test code still works. 

I also did this, although it is ugly it seems to work.  Instead of using OMP DO, I just made the region parallel and did this...

                        if (OMP_GET_THREAD_NUM() == omp_get_max_threads() - 1) then
                            iEnd = localCount
                        else
                            iEnd   = (OMP_GET_THREAD_NUM() + 1) * (localCount / omp_get_max_threads())
                        end if
                        
                        iStart = (OMP_GET_THREAD_NUM()    ) * (localCount / omp_get_max_threads()) + 1
  
                        ! main loop
                        do i = iStart, iEnd

The do loop used to be "do i = 1, localCount", so I'm just splitting the loop up manually for each thread.  Perhaps there is a way to do this more efficiently.

0 Kudos
Highlighted
419 Views

>>My production code is actually a DLL that is called by another program

If the calling code is also using OpenMP .AND. calling your DLL from within a parallel region .AND. (default) nested parallelism is disabled, then any attempt at nesting parallel regions will result in 1 thread at nest level(s).

omp_get_max_threads() does not necessarily return the number of thread for a parallel region. omp_num_threads() will.

If !$OMP PARALLEL produces a parallel region (with multiple threads) and !$OMP PARALLEL DO does not see what

allocate(threadRan(0:omp_get_max_threads()-1)) ! LOGICAL
threadRan = .false.
!$OMP PARALLEL
nThreadsParallel = omp_get_num_threads() ! integer
!$OMP DO ...
DO I=1,10
   threadRan(omp_get_thread_num()) = .true.
   ...
end DO
!$omp end do
!$omp end parallel
print *, nThreadsParallel
print *,threadRan
...

Jim Dempsey

0 Kudos
Highlighted
New Contributor II
419 Views

In Quote #11, you could try the following changes as localCount is shared. ( or use !$OMP atomic ?)

 !$OMP PARALLEL DO default (shared) private (i, id, percentToDo) reduction (+ : localCount)
    do i = 1, 10
        id = OMP_GET_THREAD_NUM()
        localCount = localCount + 1
        percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + id
        print *, i, percentToDo, id, OMP_GET_MAX_THREADS(), OMP_IN_PARALLEL()
    end do
 !$OMP END PARALLEL DO
    print *, localCount    
    print *, i, percentToDo, id, OMP_GET_MAX_THREADS(), OMP_IN_PARALLEL(),' is i=11 or 0 ?'

 

0 Kudos
Highlighted
Beginner
419 Views

As an update, I was not able to figure out why my production code will not execute the PARALLEL DO.  In answer to post #16, I do not have any nested parallel regions.  I only have a single parallel region.

I have sort of a workaround where I just create a parallel region with !$OMP PARALLEL and then in each thread I adjust the upper and lower bounds of the do loop based on the thread ID.  So basically just breaking the items into evenly spaced chunks and executing each chunk on a different thread.

However, this does not give me nearly the performance that I have seen in this code in the past.  I used to get performance gains up to about 8 processors (on an 8 core machine).  However, now I barely see any improvement past 2-4.  I suspect my manual splitup is just not balancing the load well.  I probably have some threads finishing up much earlier than others.

Here is where it gets more interesting.  I have been using Fortran 2017 Update 8.  I first asked a colleague to compile this with Version 2019 (~update 2 I think).  It did run the code in parallel, but it was much slower.   

I downloaded and installed Fortran 2020 (Update 1).  The good news is that this code does multi-thread my PARALLEL DO loop.  The bad news is the code is MUCH slower than the 2017 compiled code. Using 8 cores, the code is about 6 times slower than the code compiled in Fortran 2017 (which is running single threaded because of the afformentioned bug).  Single threaded with the new compiler, it is 11 times slower.

2017 compiled    1 thread    wall clock time = 5.5 seconds
2019 compiled    1 thread    wall clock time = 59 seconds
2019 compiled   2 thread    wall clock time = 41 seconds
2019 compiled   4 thread    wall clock time = 34 seconds
2019 compiled    8 threads    wall clock time = 32 seconds

In past versions of this code, this same problem should be running in ~1 second.

I'm going to try to open a support ticket with Intel.  Our company is going to Visual Studio 2019 and so its going to be tough for me to stick with Fortran 2017, but we can't live with this kind of slowdown.  This code was originally developed I believe with Intel 2010 and since we have gone through 2013, 2015 and 2017 without issues.  I wonder what could have changed to make my code so much slower.

I looked through all of the project settings and I see nowhere I could set any kind of compatibility or anything.

0 Kudos
Highlighted
Beginner
419 Views

John Campbell wrote:

In Quote #11, you could try the following changes as localCount is shared. ( or use !$OMP atomic ?)

 !$OMP PARALLEL DO default (shared) private (i, id, percentToDo) reduction (+ : localCount)
    do i = 1, 10
        id = OMP_GET_THREAD_NUM()
        localCount = localCount + 1
        percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + id
        print *, i, percentToDo, id, OMP_GET_MAX_THREADS(), OMP_IN_PARALLEL()
    end do
 !$OMP END PARALLEL DO
    print *, localCount    
    print *, i, percentToDo, id, OMP_GET_MAX_THREADS(), OMP_IN_PARALLEL(),' is i=11 or 0 ?'

 

Thanks John, but the value of that LocalCount variable is not really material.  I just added it in because I wanted to see how many times it was executing the loop.  It acts the same with or without that variable.  The problem I am having is that it seems to be just ignoring the OMP DO altogether.  If I create a parallel region with OMP PARALLEL followed by OMP DO, each thread just executes the whole loop meaning 8 * 10 executions and if I use PARALLEL DO, it just executes the entire loop on 1 thread.

0 Kudos
Highlighted
New Contributor II
419 Views

Philliard,

You indicated in post #11 that the !$OMP region was remaining as single thread. I have seen no indication why this happened. If you print out the value of i after the DO, it will be 11 for a non-OMP or undefined if OMP was active.  There can be poor warnings if !$OMP directives are ignored. I prefer to have explicit private and shared, to at least review how arrays are being used and replicated.

You say that OMP is not working, by comparing 2017 to 2019 performance, but the most significant change is for the non-omp (5.5 sec to 59 sec). What is the reason for this ? You need an explanation. (different test, bigger problem, reduced cacheing ?) It is a big change.

The other issue is OMP efficiency (8 threads:32 sec vs 1 thread:59 sec). There is not good efficiency there. Make sure you are comparing the OMP region performance. Then there are the inefficiency possibilities. Count the number of OMP region entries ( @ 2.e-5 sec per entry ), memory clash for updating shared arrays, have memory transfer demands increased. 

0 Kudos