Solved: OpenMP: Loops are not parallelized

Benedikt_R_ · ‎03-19-2015

Hi

I'm writing a Fortran-Program with Open-MP.

!$OMP PARALLEL
!$OMP DO PRIVATE(j)
       DO I=1,10
          j = aIndex(I)
          ...
          VALUES(j) = ...
       END DO
!$OMP END DO
!$OMP END PARALLEL

The compiler refuses to parallelize this:

OpenMP Construct at file.for(2255,7)
   remark #16201: OpenMP DEFINED REGION WAS PARALLELIZED
...

LOOP BEGIN at file.for(2258,7)
   remark #17104: loop was not parallelized: existence of parallel dependence
   remark #15300: LOOP WAS VECTORIZED
LOOP END

Actually I *do* know, that aIndex contains only different indexes. Therefore the loop *can* be parallelized.

Is there any way to overrule the compiler? In OpenACC for example I could write

!$acc loop independent private(j)

Thanks

Benedikt

pbkenned1 · ‎03-19-2015

The issue could be as Jim suggested, but are you certain the loop did not parallelize? Compile with -Qopenmp-report to make certain. If you then see it doesn't parallelize, then we need more context to understand why. I added some minimal context and the loop parallelizes:

C:\ISN_Forums\U543478>cat file.f90
program U543478
implicit none
integer, parameter :: N = 10
integer, dimension(N) :: aIndex
integer, dimension(N) :: VALUES
integer i,j

aIndex = (/(I,I=1,N)/)

!$OMP PARALLEL
!$OMP DO PRIVATE(j)
       DO I=1,N
          j = aIndex(I)
          VALUES(j) = I
       END DO
!$OMP END DO
!$OMP END PARALLEL

print *,'VALUES(1), VALUES(N) =',VALUES(1),VALUES(N)

end program U543478
C:\ISN_Forums\U543478>ifort -Qopenmp file.f90 -Qopenmp-report -Qopt-report-file=stdout
Intel(R) Visual Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.2.179 Build 20150121
Copyright (C) 1985-2015 Intel Corporation. All rights reserved.
ifort: command line remark #10010: option '/Qopenmp-report' is deprecated and will be removed in a future release. See '/help deprecated'

Begin optimization report for: U543478

Report from: OpenMP optimizations [openmp]

OpenMP Construct at C:\ISN_Forums\U543478\file.f90(11,7)
remark #16200: OpenMP DEFINED LOOP WAS PARALLELIZED
OpenMP Construct at C:\ISN_Forums\U543478\file.f90(10,7)
remark #16201: OpenMP DEFINED REGION WAS PARALLELIZED
===========================================================================
Microsoft (R) Incremental Linker Version 12.00.21005.1
Copyright (C) Microsoft Corporation. All rights reserved.

-out:file.exe
-subsystem:console
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
file.obj

C:\ISN_Forums\U543478>file
VALUES(1), VALUES(N) = 1 10

C:\ISN_Forums\U543478>

Patrick

View solution in original post

jimdempseyatthecove · ‎03-19-2015

Try this

!$OMP PARALLEL PRIVATE(i,j)
!$OMP DO
       DO I=1,10
          j = aIndex(I)
          ...
          VALUES(j) = ...
       END DO
!$OMP END DO
!$OMP END PARALLEL

You may have had an issue with j used outside the !$OMP DO

Jim Dempsey

pbkenned1 · ‎03-19-2015

The issue could be as Jim suggested, but are you certain the loop did not parallelize? Compile with -Qopenmp-report to make certain. If you then see it doesn't parallelize, then we need more context to understand why. I added some minimal context and the loop parallelizes:

C:\ISN_Forums\U543478>cat file.f90
program U543478
implicit none
integer, parameter :: N = 10
integer, dimension(N) :: aIndex
integer, dimension(N) :: VALUES
integer i,j

aIndex = (/(I,I=1,N)/)

!$OMP PARALLEL
!$OMP DO PRIVATE(j)
       DO I=1,N
          j = aIndex(I)
          VALUES(j) = I
       END DO
!$OMP END DO
!$OMP END PARALLEL

print *,'VALUES(1), VALUES(N) =',VALUES(1),VALUES(N)

end program U543478
C:\ISN_Forums\U543478>ifort -Qopenmp file.f90 -Qopenmp-report -Qopt-report-file=stdout
Intel(R) Visual Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.2.179 Build 20150121
Copyright (C) 1985-2015 Intel Corporation. All rights reserved.
ifort: command line remark #10010: option '/Qopenmp-report' is deprecated and will be removed in a future release. See '/help deprecated'

Begin optimization report for: U543478

Report from: OpenMP optimizations [openmp]

OpenMP Construct at C:\ISN_Forums\U543478\file.f90(11,7)
remark #16200: OpenMP DEFINED LOOP WAS PARALLELIZED
OpenMP Construct at C:\ISN_Forums\U543478\file.f90(10,7)
remark #16201: OpenMP DEFINED REGION WAS PARALLELIZED
===========================================================================
Microsoft (R) Incremental Linker Version 12.00.21005.1
Copyright (C) Microsoft Corporation. All rights reserved.

-out:file.exe
-subsystem:console
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
file.obj

C:\ISN_Forums\U543478>file
VALUES(1), VALUES(N) = 1 10

C:\ISN_Forums\U543478>

Patrick

Benedikt_R_ · ‎03-19-2015

Patrick! That's awesome! You are right.

The loop was parallelized. There's just a wrong (strange?, confusing?) optimizer-report.

After adding /Qopenmp-report to the compiler-switches, the report confirms

remark #16200: OpenMP DEFINED LOOP WAS PARALLELIZED

Thank you very much

Benedikt.

Steven_L_Intel1 · ‎03-19-2015

An actual example, rather than a pseudocode snippet, would be helpful in resolving your problem.

pbkenned1 · ‎03-19-2015

>>>The loop was parallelized. There's just a wrong (strange?, confusing?) optimizer-report.

Being an old hack, I compiled with -Qopenmp-report out of habit. But it's deprecated now, and /Qopt-report-phase:openmp is the suggested replacement.

Unfortunately I can't get seem to get 'OpenMP DEFINED LOOP WAS PARALLELIZED' using the suggested replacement, in combination with other -Qopt-report* compiler switches.

I'll look into this more closely and file a problem report if needed.

Patrick

John_Campbell · ‎03-19-2015

It is worth considering the overhead of using !$OMP structures.

It takes between 5 to 20 microseconds to initiate a !$OMP PARALLEL region, which is equivalent to between 10,000 to 50,000 processor cycles. If the savings from multi-thread processing do no achieve this then the there is no gain. In the example above in quote #3, the original serial loop looks to be about 50 cpu cycles, so applying !$OMP will make the code run 100's of times slower.

The recommendation for use of !$OMP usage is to parallelise the outer loop and vectorise the inner loop. I wonder if ifort should provide recommendations on the likely gains (or lack of) where this can be assessed.

John

Benedikt_R_ · ‎03-20-2015

So, my original question was answered and John's comment is somewhat out of the topic ...
but for shure I'm interested very much!

Does it mean, I shall use one "!$OMP PARALLEL" and put as much loops inside as possible?

Benedikt

pbkenned1 · ‎03-20-2015

>>>I wonder if ifort should provide recommendations on the likely gains (or lack of) where this can be assessed.

This wasn't a question about expected gains (or lack thereof) from OpenMP parallelization, but yes, a paltry 10 iterations is far, far below the threshold to expect any gains. But OpenMP assumes the programmer knows best and tries to do whatever is asked, regardless of expected gains, existence of loop carried dependencies, incorrect data sharing attributes of variables, etc. It's a sharp knife and you'll be badly cut if you mishandle it.

On the other hand, the auto-parallelizer and auto-vectorizer are somewhat more user-friendly, providing efficiency and dependency information, and as performance tuning BKM I've had good results testing the waters with -Qparallel before throwing OMP directives at the code.

Consider this non-OpenMP version of the original test case:

#ifdef BIGN
integer, parameter :: N = 10000
#else
integer, parameter :: N = 10
#endif

integer, dimension(N) :: aIndex
integer, dimension(N) :: VALUES
integer i,j

aIndex = (/(I,I=1,N)/)

!DIR$ PARALLEL PRIVATE(j)
       DO I=1,N
          j = aIndex(I)
          VALUES(j) = I
       END DO

*** Case for N == 10 **************

C:\ISN_Forums\U543478>ifort -Qparallel -Qopt-report -Qopt-report-file:stdout -Qopt-report-phase:par file-no-omp.f90 -fpp
Intel(R) Visual Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.2.179 Build 20150121
Copyright (C) 1985-2015 Intel Corporation. All rights reserved.

Begin optimization report for: U543478

Report from: Auto-parallelization optimizations [par]

LOOP BEGIN at C:\ISN_Forums\U543478\file-no-omp.f90(14,1)
remark #17108: loop was not parallelized: insufficient computational work
LOOP END

LOOP BEGIN at C:\ISN_Forums\U543478\file-no-omp.f90(14,1)
<Remainder>
LOOP END

LOOP BEGIN at C:\ISN_Forums\U543478\file-no-omp.f90(17,8)
remark #17108: loop was not parallelized: insufficient computational work
LOOP END

*** Case for N == 10000 **************

C:\ISN_Forums\U543478>ifort -Qparallel -Qopt-report -Qopt-report-file:stdout -Qopt-report-phase:par file-no-omp.f90 -fpp -DBIGN
Intel(R) Visual Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.2.179 Build 20150121
Copyright (C) 1985-2015 Intel Corporation. All rights reserved.

Begin optimization report for: U543478

Report from: Auto-parallelization optimizations [par]

LOOP BEGIN at C:\ISN_Forums\U543478\file-no-omp.f90(14,1)
remark #17109: LOOP WAS AUTO-PARALLELIZED
LOOP END

LOOP BEGIN at C:\ISN_Forums\U543478\file-no-omp.f90(17,8)
remark #17109: LOOP WAS AUTO-PARALLELIZED
LOOP END

Patrick

pbkenned1 · ‎03-20-2015

>>> I can't get seem to get 'OpenMP DEFINED LOOP WAS PARALLELIZED' using the suggested replacement

Reported to the developers, internal tracking ID DPD200368047. I'll keep this thread updated with any news.

Patrick

jimdempseyatthecove · ‎03-20-2015

Patrick, Benedikt,

In Patrick's response #9 above, although the reports indicates OpenMP chose not to parallelize the simplified loop, and auto-parallelization did parallelize the loop, this example does not indicate which choice was better. Unfortunately Patrick's example does not list the execution times for first pass and say some pass after cache is conditioned (~3rd pass).

Getting parallization is not in of itself the object of the exercise. Rather, the object should be getting faster and/or efficient as possible.

Here are some guidelines:

a) The very first !$OMP PARALLEL (anything) incurs the application one-time overhead of creating the OpenMP thread pool. You cannot avoid this overhead unless you decide not to parallelize your program. You can reduce the overhead by selecting lesser number of threads than hardware threads on the system. However, this leads to a trade-off between degree of parallelization the overhead in creating the thread pool.

(the following assumes non-nested parallel regions)

b) Subsequent !$OMP PARALLEL (anything), when issued from the serial region of the application, does not instantiate the OpenMP thread pool - so that overhead is eliminated. However, there is an overhead of making a system call to wakeup threads within the thread pool that have timed out their spin-wait from prior parallel region. The more threads of the thread pool that have been suspended, the higher the overhead. Spin-wait time is tunable by setting the kmp block time (environment variable and/or runtime function). Typically the default is ~200ms.

c) !$OMP DO, within an OpenMP parallel region generally has less overhead than !$OMP PARALLEL DO when there is more than one !$OMP DO within the parallel region. Therefore, when you have two or more parallel loops that run one after another, with little or no code between, then it is best to have one !$OMP PARALLEL region containing the multiple !$OMP DO. However, if the little code between the parallel loops performs I/O you might consider not using a single parallel region. Note 2, be mindfull that all threads execute the parallel region and this may require use of other !$OMP directives: MASTER, SINGLE, BARRIER, ...

d) If (when) the multiple loops within the parallel region are completely independent (not dependent on results of earlier loops), then consider hand partitioning the loops based on: number of threads, thread number, and loop count. An alternative is to configure for some of the threads to do each loop concurrently

e) After you get through to d) and need more utilization, there are additional options for you to use. Master a), b), c) and d) first.

Jim Dempsey