- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>When I use the !$OMP PARALLELDO PRIVATE (...) ... !$OMP END PARALLELDO, only one thread is created and it makes the whole loop.
Prior to !$OMP... are you already in a parallel region?
The behavior is as if you are, or have set the number of threads to 1.
Just prior to !$OMP PARALLELDO PRIVATE (...) insert
PRINT *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS()
You should see .FALSE. and 20
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you copy and paste (using {...}code button) the exact text of the !$OMP statement inclusive of the Fortran DO statement.
Also, post your file name (iow is it xxx.f90, xxx.f, xxx.for, etc...) i.e. is the source compiled as fixed form or free form.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Add private integer variables: myFirst, myLast
...
myFirst = 0
$omp do
do i=... ! use other index if not i
if(myFirst==0) myFirst=i !use other index if not i
myLast=i
... ! remainder of loop
end do
!$omp end do
print *, omp_get_thread_nim(), omp_in_parallel(), myFirst, myLast
$omp end parallel
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>10 T 1 1
What is the iteration count of your DO loop???
?? less than number of threads?
It appears that you have an outer loop (not shown) that sets the iteration count of the inner (parallel) loop to 1, then 2, then 3, ...
This should be OK, though inefficient, to have an iteration count less than the thread count. It should work.
Can you make, and post a simple reproducer.
It is odd that the release version works but the debug version does not. Usually it is the other way around.
ifort -module intOz/obj/Release/ -fpp -check bounds -free -assume norealloc_lhs -qopenmp -xCORE-AVX2 -O2 -traceback -I../cdi/installed-intel-ozonosfera-1.6.2/include -I../FLAP/compiled-intel-ozonosfera/static/mod -c -Tf clasif_kmedias.f90 -o intOz/obj/Release/clasif-kemedias/clasif_kmedias.o ifort -module intOz/obj/Debug/ -fpp -check all -free -assume norealloc_lhs -qopenmp -traceback -warn all -debug full -check noarg_temp_created -heap-arrays -I../cdi/installed-intel-ozonosfera-1.6.2/include -I../FLAP/compiled-intel-ozonosfera/static/mod -c -Tf clasif_kmedias.f90 -o intOz/obj/Debug /clasif-kemedias/clasif_kmedias.o
In trying to isolate the issue (such that you can work around the problem), try changing only the Debug build, one option at a time, to that of the release build, you may find the offending option.
A simple reproducer can be posted here and submitted to Intel.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Was this ever resolved? I have a very similar problem with a code that I used to have running in parallel but something has changed and I cannot figure out why it doesn't work any more. I simplified down to this:
print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS() !$OMP PARALLEL print *, 'hello from thread:', OMP_GET_THREAD_NUM() !$OMP END PARALLEL print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS() req%Aii(1) = percentComplete req%Aii(2) = 100. - percentComplete localCount = 0 !$OMP PARALLEL DO default (shared) private (i, percentToDo) do i = 1, 10 localCount = localCount + 1 percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + OMP_GET_THREAD_NUM() print *, i, percentToDo, OMP_GET_THREAD_NUM() end do !$OMP END PARALLEL DO print *, localCount print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS() localCount = 0 !$OMP PARALLEL default (shared) private (i, percentToDo) !$OMP DO do i = 1, 10 localCount = localCount + 1 percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + OMP_GET_THREAD_NUM() print *, i, percentToDo, OMP_GET_THREAD_NUM() end do !$OMP END DO !$OMP END PARALLEL print *, localCount print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS()
The first PARALLEL region works perfectly
F 8
hello from thread: 0
hello from thread: 6
hello from thread: 5
hello from thread: 3
hello from thread: 1
hello from thread: 2
hello from thread: 4
hello from thread: 7
F 8
The PARALLEL DO runs each iteration on thread 0.
1 0.862318873405457 0
2 0.862318873405457 0
3 0.862318873405457 0
4 0.862318873405457 0
5 0.862318873405457 0
6 0.862318873405457 0
7 0.862318873405457 0
8 0.862318873405457 0
9 0.862318873405457 0
10 0.862318873405457 0
10
The region where I put OMP PARALLEL followed by OMP DO executes the entire loop for every thread.
F 8
1 0.862318873405457 0
1 5.86231899261475 5
2 5.86231899261475 5
3 5.86231899261475 5
4 5.86231899261475 5
1 6.86231899261475 6
5 5.86231899261475 5
1 3.86231899261475 3
6 5.86231899261475 5
...
7 1.86231887340546 1
8 1.86231887340546 1
9 1.86231887340546 1
10 1.86231887340546 1
80
F 8
Any insight would be appreciated
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The compiler will attempt to optimize the Parallel DO. In this specific case, the compiler determined that the work performed within the parallel region was insufficient to use more than one thread. For this specific loop, if you wish to force the loop to partition the iteration space to one index, then add the clause: SCHEDULE(static, 1):
!$OMP PARALLEL DO default (shared) private (i, percentToDo) schedule(static, 1) do i = 1, 10 localCount = localCount + 1 percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + OMP_GET_THREAD_NUM() print *, i, percentToDo, OMP_GET_THREAD_NUM() end do !$OMP END PARALLEL DO ... !$OMP PARALLEL default (shared) private (i, percentToDo) !$OMP DO schedule(static, 1) do i = 1, 10 localCount = localCount + 1 percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + OMP_GET_THREAD_NUM() print *, i, percentToDo, OMP_GET_THREAD_NUM() end do !$OMP END DO !$OMP END PARALLEL
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, Jim.
As far as I can tell, schedule(static, 1) seems to have no effect on either loop.
Conceptually, I thought that
!$OMP PARALLEL DO default (shared) private (percentToDo) do i = 1, 10
and
!$OMP PARALLEL default (shared) private (i, percentToDo) !$OMP DO do i = 1, 10
were supposed to be functionally equivalent, so it seems telling that the second version seems to execute the entire loop for each thread (total of 10 * 8 = 80 passes). It seems like it is ignoring the "!$OMP DO" completely.
This code snipped was simplified from a code that does much more work in the PARALLEL DO. It was at one point working, but no longer. The only substantial change I can think of is that I am using a newer version of the compiler (although not latest). Currently using 17.0.8.275 (64 bit).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
program foo use omp_lib implicit none type req_t real :: Aii(10) end type req_t type(req_t) :: req integer :: i, localCount real :: percentToDo, percentComplete print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS() !$OMP PARALLEL print *, 'hello from thread:', OMP_GET_THREAD_NUM() !$OMP END PARALLEL print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS() percentComplete = 0.0 req%Aii(1) = percentComplete req%Aii(2) = 100. - percentComplete localCount = 0 !$OMP PARALLEL DO default (shared) private (i, percentToDo) do i = 1, 10 localCount = localCount + 1 percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + OMP_GET_THREAD_NUM() print *, i, percentToDo, OMP_GET_THREAD_NUM() end do !$OMP END PARALLEL DO print *, localCount print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS() localCount = 0 !$OMP PARALLEL default (shared) private (i, percentToDo) !$OMP DO do i = 1, 10 localCount = localCount + 1 percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + OMP_GET_THREAD_NUM() print *, i, percentToDo, OMP_GET_THREAD_NUM() end do !$OMP END DO !$OMP END PARALLEL print *, localCount print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS() end program foo ============= output ============ F 8 hello from thread: 1 hello from thread: 2 hello from thread: 7 hello from thread: 3 hello from thread: 5 hello from thread: 4 hello from thread: 6 hello from thread: 0 F 8 6 3.862319 3 1 0.8623189 0 10 7.862319 7 5 2.862319 2 3 1.862319 1 7 4.862319 4 8 5.862319 5 4 1.862319 1 9 6.862319 6 2 0.8623189 0 10 F 8 1 0.8623189 0 8 5.862319 5 9 6.862319 6 7 4.862319 4 5 2.862319 2 2 0.8623189 0 3 1.862319 1 4 1.862319 1 6 3.862319 3 10 7.862319 7 9 F 8
And:
program foo use omp_lib implicit none type req_t real :: Aii(10) end type req_t type(req_t) :: req integer :: i, localCount real :: percentToDo, percentComplete print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS() !$OMP PARALLEL print *, 'hello from thread:', OMP_GET_THREAD_NUM() !$OMP END PARALLEL print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS() percentComplete = 0.0 req%Aii(1) = percentComplete req%Aii(2) = 100. - percentComplete localCount = 0 !$OMP PARALLEL DO default (shared) private (i, percentToDo) schedule(static, 1) do i = 1, 10 localCount = localCount + 1 percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + OMP_GET_THREAD_NUM() print *, i, percentToDo, OMP_GET_THREAD_NUM() end do !$OMP END PARALLEL DO print *, localCount print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS() localCount = 0 !$OMP PARALLEL default (shared) private (i, percentToDo) !$OMP DO schedule(static, 1) do i = 1, 10 localCount = localCount + 1 percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + OMP_GET_THREAD_NUM() print *, i, percentToDo, OMP_GET_THREAD_NUM() end do !$OMP END DO !$OMP END PARALLEL print *, localCount print *, OMP_IN_PARALLEL(), OMP_GET_MAX_THREADS() end program foo ====================== output ================ F 8 hello from thread: 1 hello from thread: 0 hello from thread: 6 hello from thread: 2 hello from thread: 7 hello from thread: 3 hello from thread: 4 hello from thread: 5 F 8 7 6.862319 6 2 1.862319 1 6 5.862319 5 3 2.862319 2 4 3.862319 3 10 1.862319 1 1 0.8623189 0 8 7.862319 7 9 0.8623189 0 5 4.862319 4 10 F 8 1 0.8623189 0 2 1.862319 1 9 0.8623189 0 4 3.862319 3 7 6.862319 6 5 4.862319 4 6 5.862319 5 10 1.862319 1 8 7.862319 7 3 2.862319 2 10 F 8
On my system the schedule(static, 1) was not required.
Intel® Parallel Studio XE 2020 Composer Edition for Fortran Windows* Package ID: w_comp_lib_2020.0.166
Intel® Parallel Studio XE 2020 Composer Edition for Fortran Windows* Integration for Microsoft Visual Studio* 2019,
Version 19.1.0055.16, Copyright © 2002-2019 Intel Corporation. All rights reserved.
Not sure what is going on on your end.
Copy and paste the sample code I have above (sans the output). See what happens.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It doesn't work for me. It doesn't build the solution:
Error fatal error LNK1120: 3 unresolved externals x64\Debug\zoo.exe
Error error LNK2019: unresolved external symbol omp_get_max_threads referenced in function MAIN__ zoo.obj
Error error LNK2019: unresolved external symbol omp_in_parallel referenced in function MAIN__ zoo.obj
Error error LNK2019: unresolved external symbol omp_get_thread_num referenced in function MAIN__ zoo.obj
Should I write anything in the command line?
I have installed Visual Studio 2019 Community and Intel oneapi 2021 base and HPC toolkit.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You are reacting to an old conversation, it is better to start a new one. But from the error messages you post, it looks as if you have not turned on the OpenMP option in the properties of your program's project.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It is really bizarre. I did the same as you... I created a standalone console App and pasted my code in. It works perfectly the same as you demonstrated.
My production code is actually a DLL that is called by another program (and it also is compiled with several static libraries). In my little standalone test code, I also made a static library, compiled into a DLL and called with a different code. It still works correctly.
I went line-by-line through the settings for the project of the test code and the production code and they are pretty much the same. I made the settings in the test code match those in the production code and the test code still works.
I also did this, although it is ugly it seems to work. Instead of using OMP DO, I just made the region parallel and did this...
if (OMP_GET_THREAD_NUM() == omp_get_max_threads() - 1) then iEnd = localCount else iEnd = (OMP_GET_THREAD_NUM() + 1) * (localCount / omp_get_max_threads()) end if iStart = (OMP_GET_THREAD_NUM() ) * (localCount / omp_get_max_threads()) + 1 ! main loop do i = iStart, iEnd
The do loop used to be "do i = 1, localCount", so I'm just splitting the loop up manually for each thread. Perhaps there is a way to do this more efficiently.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>My production code is actually a DLL that is called by another program
If the calling code is also using OpenMP .AND. calling your DLL from within a parallel region .AND. (default) nested parallelism is disabled, then any attempt at nesting parallel regions will result in 1 thread at nest level(s).
omp_get_max_threads() does not necessarily return the number of thread for a parallel region. omp_num_threads() will.
If !$OMP PARALLEL produces a parallel region (with multiple threads) and !$OMP PARALLEL DO does not see what
allocate(threadRan(0:omp_get_max_threads()-1)) ! LOGICAL threadRan = .false. !$OMP PARALLEL nThreadsParallel = omp_get_num_threads() ! integer !$OMP DO ... DO I=1,10 threadRan(omp_get_thread_num()) = .true. ... end DO !$omp end do !$omp end parallel print *, nThreadsParallel print *,threadRan ...
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In Quote #11, you could try the following changes as localCount is shared. ( or use !$OMP atomic ?)
!$OMP PARALLEL DO default (shared) private (i, id, percentToDo) reduction (+ : localCount) do i = 1, 10 id = OMP_GET_THREAD_NUM() localCount = localCount + 1 percentToDo = (sin(req%Aii(1)) + cos(req%Aii(2))) + id print *, i, percentToDo, id, OMP_GET_MAX_THREADS(), OMP_IN_PARALLEL() end do !$OMP END PARALLEL DO print *, localCount print *, i, percentToDo, id, OMP_GET_MAX_THREADS(), OMP_IN_PARALLEL(),' is i=11 or 0 ?'
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As an update, I was not able to figure out why my production code will not execute the PARALLEL DO. In answer to post #16, I do not have any nested parallel regions. I only have a single parallel region.
I have sort of a workaround where I just create a parallel region with !$OMP PARALLEL and then in each thread I adjust the upper and lower bounds of the do loop based on the thread ID. So basically just breaking the items into evenly spaced chunks and executing each chunk on a different thread.
However, this does not give me nearly the performance that I have seen in this code in the past. I used to get performance gains up to about 8 processors (on an 8 core machine). However, now I barely see any improvement past 2-4. I suspect my manual splitup is just not balancing the load well. I probably have some threads finishing up much earlier than others.
Here is where it gets more interesting. I have been using Fortran 2017 Update 8. I first asked a colleague to compile this with Version 2019 (~update 2 I think). It did run the code in parallel, but it was much slower.
I downloaded and installed Fortran 2020 (Update 1). The good news is that this code does multi-thread my PARALLEL DO loop. The bad news is the code is MUCH slower than the 2017 compiled code. Using 8 cores, the code is about 6 times slower than the code compiled in Fortran 2017 (which is running single threaded because of the afformentioned bug). Single threaded with the new compiler, it is 11 times slower.
2017 compiled 1 thread wall clock time = 5.5 seconds
2019 compiled 1 thread wall clock time = 59 seconds
2019 compiled 2 thread wall clock time = 41 seconds
2019 compiled 4 thread wall clock time = 34 seconds
2019 compiled 8 threads wall clock time = 32 seconds
In past versions of this code, this same problem should be running in ~1 second.
I'm going to try to open a support ticket with Intel. Our company is going to Visual Studio 2019 and so its going to be tough for me to stick with Fortran 2017, but we can't live with this kind of slowdown. This code was originally developed I believe with Intel 2010 and since we have gone through 2013, 2015 and 2017 without issues. I wonder what could have changed to make my code so much slower.
I looked through all of the project settings and I see nowhere I could set any kind of compatibility or anything.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page