- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
I'm writing a Fortran-Program with Open-MP.
!$OMP PARALLEL !$OMP DO PRIVATE(j) DO I=1,10 j = aIndex(I) ... VALUES(j) = ... END DO !$OMP END DO !$OMP END PARALLEL
The compiler refuses to parallelize this:
OpenMP Construct at file.for(2255,7) remark #16201: OpenMP DEFINED REGION WAS PARALLELIZED ... LOOP BEGIN at file.for(2258,7) remark #17104: loop was not parallelized: existence of parallel dependence remark #15300: LOOP WAS VECTORIZED LOOP END
Actually I *do* know, that aIndex contains only different indexes. Therefore the loop *can* be parallelized.
Is there any way to overrule the compiler? In OpenACC for example I could write
!$acc loop independent private(j)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The issue could be as Jim suggested, but are you certain the loop did not parallelize? Compile with -Qopenmp-report to make certain. If you then see it doesn't parallelize, then we need more context to understand why. I added some minimal context and the loop parallelizes:
C:\ISN_Forums\U543478>cat file.f90
program U543478
implicit none
integer, parameter :: N = 10
integer, dimension(N) :: aIndex
integer, dimension(N) :: VALUES
integer i,j
aIndex = (/(I,I=1,N)/)
!$OMP PARALLEL
!$OMP DO PRIVATE(j)
DO I=1,N
j = aIndex(I)
VALUES(j) = I
END DO
!$OMP END DO
!$OMP END PARALLEL
print *,'VALUES(1), VALUES(N) =',VALUES(1),VALUES(N)
end program U543478
C:\ISN_Forums\U543478>ifort -Qopenmp file.f90 -Qopenmp-report -Qopt-report-file=stdout
Intel(R) Visual Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.2.179 Build 20150121
Copyright (C) 1985-2015 Intel Corporation. All rights reserved.
ifort: command line remark #10010: option '/Qopenmp-report' is deprecated and will be removed in a future release. See '/help deprecated'
Begin optimization report for: U543478
Report from: OpenMP optimizations [openmp]
OpenMP Construct at C:\ISN_Forums\U543478\file.f90(11,7)
remark #16200: OpenMP DEFINED LOOP WAS PARALLELIZED
OpenMP Construct at C:\ISN_Forums\U543478\file.f90(10,7)
remark #16201: OpenMP DEFINED REGION WAS PARALLELIZED
===========================================================================
Microsoft (R) Incremental Linker Version 12.00.21005.1
Copyright (C) Microsoft Corporation. All rights reserved.
-out:file.exe
-subsystem:console
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
file.obj
C:\ISN_Forums\U543478>file
VALUES(1), VALUES(N) = 1 10
C:\ISN_Forums\U543478>
Patrick
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try this
!$OMP PARALLEL PRIVATE(i,j) !$OMP DO DO I=1,10 j = aIndex(I) ... VALUES(j) = ... END DO !$OMP END DO !$OMP END PARALLEL
You may have had an issue with j used outside the !$OMP DO
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The issue could be as Jim suggested, but are you certain the loop did not parallelize? Compile with -Qopenmp-report to make certain. If you then see it doesn't parallelize, then we need more context to understand why. I added some minimal context and the loop parallelizes:
C:\ISN_Forums\U543478>cat file.f90
program U543478
implicit none
integer, parameter :: N = 10
integer, dimension(N) :: aIndex
integer, dimension(N) :: VALUES
integer i,j
aIndex = (/(I,I=1,N)/)
!$OMP PARALLEL
!$OMP DO PRIVATE(j)
DO I=1,N
j = aIndex(I)
VALUES(j) = I
END DO
!$OMP END DO
!$OMP END PARALLEL
print *,'VALUES(1), VALUES(N) =',VALUES(1),VALUES(N)
end program U543478
C:\ISN_Forums\U543478>ifort -Qopenmp file.f90 -Qopenmp-report -Qopt-report-file=stdout
Intel(R) Visual Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.2.179 Build 20150121
Copyright (C) 1985-2015 Intel Corporation. All rights reserved.
ifort: command line remark #10010: option '/Qopenmp-report' is deprecated and will be removed in a future release. See '/help deprecated'
Begin optimization report for: U543478
Report from: OpenMP optimizations [openmp]
OpenMP Construct at C:\ISN_Forums\U543478\file.f90(11,7)
remark #16200: OpenMP DEFINED LOOP WAS PARALLELIZED
OpenMP Construct at C:\ISN_Forums\U543478\file.f90(10,7)
remark #16201: OpenMP DEFINED REGION WAS PARALLELIZED
===========================================================================
Microsoft (R) Incremental Linker Version 12.00.21005.1
Copyright (C) Microsoft Corporation. All rights reserved.
-out:file.exe
-subsystem:console
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
file.obj
C:\ISN_Forums\U543478>file
VALUES(1), VALUES(N) = 1 10
C:\ISN_Forums\U543478>
Patrick
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Patrick! That's awesome! You are right.
The loop was parallelized. There's just a wrong (strange?, confusing?) optimizer-report.
After adding /Qopenmp-report to the compiler-switches, the report confirms
remark #16200: OpenMP DEFINED LOOP WAS PARALLELIZED
Thank you very much
Benedikt.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
An actual example, rather than a pseudocode snippet, would be helpful in resolving your problem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>The loop was parallelized. There's just a wrong (strange?, confusing?) optimizer-report.
Being an old hack, I compiled with -Qopenmp-report out of habit. But it's deprecated now, and /Qopt-report-phase:openmp is the suggested replacement.
Unfortunately I can't get seem to get 'OpenMP DEFINED LOOP WAS PARALLELIZED' using the suggested replacement, in combination with other -Qopt-report* compiler switches.
I'll look into this more closely and file a problem report if needed.
Patrick
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It is worth considering the overhead of using !$OMP structures.
It takes between 5 to 20 microseconds to initiate a !$OMP PARALLEL region, which is equivalent to between 10,000 to 50,000 processor cycles. If the savings from multi-thread processing do no achieve this then the there is no gain. In the example above in quote #3, the original serial loop looks to be about 50 cpu cycles, so applying !$OMP will make the code run 100's of times slower.
The recommendation for use of !$OMP usage is to parallelise the outer loop and vectorise the inner loop. I wonder if ifort should provide recommendations on the likely gains (or lack of) where this can be assessed.
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So, my original question was answered and John's comment is somewhat out of the topic ...
but for shure I'm interested very much!
Does it mean, I shall use one "!$OMP PARALLEL" and put as much loops inside as possible?
Benedikt
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>I wonder if ifort should provide recommendations on the likely gains (or lack of) where this can be assessed.
This wasn't a question about expected gains (or lack thereof) from OpenMP parallelization, but yes, a paltry 10 iterations is far, far below the threshold to expect any gains. But OpenMP assumes the programmer knows best and tries to do whatever is asked, regardless of expected gains, existence of loop carried dependencies, incorrect data sharing attributes of variables, etc. It's a sharp knife and you'll be badly cut if you mishandle it.
On the other hand, the auto-parallelizer and auto-vectorizer are somewhat more user-friendly, providing efficiency and dependency information, and as performance tuning BKM I've had good results testing the waters with -Qparallel before throwing OMP directives at the code.
Consider this non-OpenMP version of the original test case:
#ifdef BIGN
integer, parameter :: N = 10000
#else
integer, parameter :: N = 10
#endif
integer, dimension(N) :: aIndex
integer, dimension(N) :: VALUES
integer i,j
aIndex = (/(I,I=1,N)/)
!DIR$ PARALLEL PRIVATE(j)
DO I=1,N
j = aIndex(I)
VALUES(j) = I
END DO
*** Case for N == 10 **************
C:\ISN_Forums\U543478>ifort -Qparallel -Qopt-report -Qopt-report-file:stdout -Qopt-report-phase:par file-no-omp.f90 -fpp
Intel(R) Visual Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.2.179 Build 20150121
Copyright (C) 1985-2015 Intel Corporation. All rights reserved.
Begin optimization report for: U543478
Report from: Auto-parallelization optimizations [par]
LOOP BEGIN at C:\ISN_Forums\U543478\file-no-omp.f90(14,1)
remark #17108: loop was not parallelized: insufficient computational work
LOOP END
LOOP BEGIN at C:\ISN_Forums\U543478\file-no-omp.f90(14,1)
<Remainder>
LOOP END
LOOP BEGIN at C:\ISN_Forums\U543478\file-no-omp.f90(17,8)
remark #17108: loop was not parallelized: insufficient computational work
LOOP END
*** Case for N == 10000 **************
C:\ISN_Forums\U543478>ifort -Qparallel -Qopt-report -Qopt-report-file:stdout -Qopt-report-phase:par file-no-omp.f90 -fpp -DBIGN
Intel(R) Visual Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.2.179 Build 20150121
Copyright (C) 1985-2015 Intel Corporation. All rights reserved.
Begin optimization report for: U543478
Report from: Auto-parallelization optimizations [par]
LOOP BEGIN at C:\ISN_Forums\U543478\file-no-omp.f90(14,1)
remark #17109: LOOP WAS AUTO-PARALLELIZED
LOOP END
LOOP BEGIN at C:\ISN_Forums\U543478\file-no-omp.f90(17,8)
remark #17109: LOOP WAS AUTO-PARALLELIZED
LOOP END
Patrick
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>> I can't get seem to get 'OpenMP DEFINED LOOP WAS PARALLELIZED' using the suggested replacement
Reported to the developers, internal tracking ID DPD200368047. I'll keep this thread updated with any news.
Patrick
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Patrick, Benedikt,
In Patrick's response #9 above, although the reports indicates OpenMP chose not to parallelize the simplified loop, and auto-parallelization did parallelize the loop, this example does not indicate which choice was better. Unfortunately Patrick's example does not list the execution times for first pass and say some pass after cache is conditioned (~3rd pass).
Getting parallization is not in of itself the object of the exercise. Rather, the object should be getting faster and/or efficient as possible.
Here are some guidelines:
a) The very first !$OMP PARALLEL (anything) incurs the application one-time overhead of creating the OpenMP thread pool. You cannot avoid this overhead unless you decide not to parallelize your program. You can reduce the overhead by selecting lesser number of threads than hardware threads on the system. However, this leads to a trade-off between degree of parallelization the overhead in creating the thread pool.
(the following assumes non-nested parallel regions)
b) Subsequent !$OMP PARALLEL (anything), when issued from the serial region of the application, does not instantiate the OpenMP thread pool - so that overhead is eliminated. However, there is an overhead of making a system call to wakeup threads within the thread pool that have timed out their spin-wait from prior parallel region. The more threads of the thread pool that have been suspended, the higher the overhead. Spin-wait time is tunable by setting the kmp block time (environment variable and/or runtime function). Typically the default is ~200ms.
c) !$OMP DO, within an OpenMP parallel region generally has less overhead than !$OMP PARALLEL DO when there is more than one !$OMP DO within the parallel region. Therefore, when you have two or more parallel loops that run one after another, with little or no code between, then it is best to have one !$OMP PARALLEL region containing the multiple !$OMP DO. However, if the little code between the parallel loops performs I/O you might consider not using a single parallel region. Note 2, be mindfull that all threads execute the parallel region and this may require use of other !$OMP directives: MASTER, SINGLE, BARRIER, ...
d) If (when) the multiple loops within the parallel region are completely independent (not dependent on results of earlier loops), then consider hand partitioning the loops based on: number of threads, thread number, and loop count. An alternative is to configure for some of the threads to do each loop concurrently
e) After you get through to d) and need more utilization, there are additional options for you to use. Master a), b), c) and d) first.
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page