- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am experimenting with DO CONCURRENT under IFX 2023.0.0 20221201 for CPU level threading. I have noticed behavior that seems less intuitive than what I find under IFORT, and inconsistent with OpenMP behavior.
In short, it seems that under IFX the number of threads used by a DO CONCURRENT construct is equal to the setting of environment variable OMP_NUM_THREADS if set, and cannot be overridden by the omp_set_num_threads() API. Instrumenting the same loop with !$omp parallel do yields the expected results (i.e. the value set by omp_set_num_threads() taking priority, even for IFX). Also thread number for DO CONCURRENT under IFORT is consistent with what is set by omp_set_num_threads(), but not for IFX.
I can work around the issue by explicitly removing environment variable OMP_NUM_THREADS for IFX, then value set by omp_set_num_threads() API. is indeed used for the DO CONCURRENT construct.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could be a bug. I'll test it out and get an answer. I too would expect omp_set_num_threads() to override OMP_NUM_THREADS for the do concurrent.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ron, thank you for looking into this issue. Were you able to reproduce the problem on your end?
The latest IFX version I have available is 23.0.0. Maybe 23.2.0 has this resolved? Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is much more complex that I thought. For our Front End, we mark the loop as a parallel loop, along with other information related to the data used inside the loop, and pass that to our optimizer and parallelization passes. This is where things get interesting. If the DO CONCURRENT is inside an outer parallel do region, the parallel optimization phase has some choices. Like you, I thought it would just OMP thread with PARALLEL DO. but another choice is IF this inside an outer loop, see if the loops can be collapsed. Then the preference is to vectorize the DO CONCURRENT and not thread it.
This matches the general strategy of "parallelize the outermost loop, vectorize the innermost loop".
So OMP threading may not be done at all! It may just reduce the DO CONCURRENT to a normal vectorized loop.
What evidence do you have that in your case it is running the do concurrent as a threaded omp loop?
Tomorrow I hope to test a case like this. Under Vtune. This should show the threading behavior.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'll just throw in that the Fortran language does not require DO CONCURRENT to be run in parallel. Rather, it establishes conditions that permit parallelization. Vectorizing is a form of parallelization.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Steve, I do realize that the Fortran language standard does not require DO CONCURRENT to be run in parallel. The way I was arriving at my conclusions about OMP_NUM_THREADS taking priority over the value set through omp_set_num_threads() API, was as follows:
I have a test program that I use for testing that contains a double do loop over 200 x 100 elements. In the particular case I changed the outer (200 iterations) do loop to "do concurrent". The serial loop alone takes about 8s to execute. Long enough to watch it with top using 1s updates.
I was using this test program for a while, and have found it convenient to change the number of threads available for OpenMP or DO CONCURRENT loops via the omp_set_num_threads() API from within the program. Watching with top I noticed that for IFX it still ran single threaded, even when setting omp_set_num_threads() to 2, 4, or 8 threads. I was baffled by this, because my experience with IFORT, and other Fortran compilers had been that I could change the number of threads this way. I finally noticed that by default, inside the interactive queue I was executing this, environment variable OMP_NUM_THREADS was set to 1. I just never bothered unsetting it before, because I was used to the omp_set_num_threads() setting to override what came from the OMP_NUM_THREADS environment variable. Well, I unset the OMP_NUM_THREADS variable, and voila, I started seeing the different number of threads running in the DO CONCURRENT loop (using top), according to what I am setting with omp_set_num_threads(). Also performance was scaling almost perfectly as expected with number of threads.
So while I agree that the Fortran standard does not require DO CONCURRENT to run in parallel, it seem that the compiler does actually generate code for it here, just that the OMP_NUM_THREADS=1 in the environment kept it at single threaded, regardless of what I am setting with omp_set_num_threads() from within the program itself.
So far I have only tested with 23.0.0, but have now access to 23.2.1. I will re-test with that version soon to see if anything might have changed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>> OMP_NUM_THREADS was set to 1
You did not mention that in your original post. Setting the OMP_NUM_THREADS environment variable sets the OMP_MAX_THREADS value. Thus, places an upper limit to the omp_set_num_threads(nn) value. Note, omp_set_num_threads(nn) upper value may also depend on if executed within a parallel region and IIF nested parallelism is enabled or not.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
However, even with OMP_NUM_THREADS=1 in the environment:
- IFORT accepts omp_set_num_threads(nn) with nn>1 fine for OpenMP loops.
- IFORT accepts omp_set_num_threads(nn) with nn>1 fine for DO-CONCURRENT loops.
- IFX accepts omp_set_num_threads(nn) with nn>1 fine for OpenMP loops.
Just the IFX for DO-CONCURRENT loops behaves differently. Again this is with 2023.0.0, and I am planning on testing with 2023.2.1 as soon as I can. But will have to wait 'till early next week for it.
But are you saying that the three cases listed above are actually violating the OMP_MAX_THREADS value set implicitly when OMP_NUM_THREADS is found in the environment?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Theurich Testing with 23.2.1 is a good test. We froze code for 23.0 around early October 2022. Since then and up to code freeze for 2023.2.x we've put in 571 fixes. In particular, DO CONCURRENT had many changes for functionality, bugs, AND performance leading up to 2023.2.0. This includes the F2023 REDUCTION clause which could be important for you in the future. Also, the locality-spec features, DEFAULT, LOCAL, LOCAL_INIT, SHARED, received a lot of attention in the first half of 2023, after 2023.0 released. In short, a lot of work on DO CONCURRENT in 2023. Now, will some of this affect your code? Hard to say. But I can say that if I had code with DO CONCURRENT, I would upgrade immediately.
Let us know what you find. And if find an issue, a code example could help us fix anything sub-par.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Ron_Green I finally have access to 2023.2.1 (specifically: ifx (IFX) 2023.2.0 20230721), and I tested again with DO CONCURRENT. Still the same behavior: as long as `OMP_NUM_THREADS` is set in the environment, any change from within the program via `omp_set_num_threads(nn)` are ignored. However, as soon as I unset `OMP_NUM_THREADS`, all works as expected.
Not big deal to unset `OMP_NUM_THREADS`, but it is different from the OpenMP behavior, where the API call `omp_set_num_threads(nn)` from within the program takes priority over what comes from the environment.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It would be nice to get confirmation that the current behavior with IFX + OMP_NUM_THREADS + DO-CONCURRENT is not the intended behavior, and that we can expect future versions of IFX to eventually move toward the more consistent behavior as discussed.
I.e. calling omp_set_num_threads(nn) from inside the code will take priority over environment variable OMP_NUM_THREADS. As it is for all of the other cases as outlined above:
- IFORT accepts omp_set_num_threads(nn) with nn>1 fine for OpenMP loops.
- IFORT accepts omp_set_num_threads(nn) with nn>1 fine for DO-CONCURRENT loops.
- IFX accepts omp_set_num_threads(nn) with nn>1 fine for OpenMP loops.
Thanks,
-Gerhard
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Barbara_P_Intel , your comment reminded me of Tom Hanks and Wilson. The package finally arrived.
Thanks for the smile, it is better than Fortraning.
Is Fortraning a real word?
Should it be Fortranning?
or FORTRANing?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Consider the following code:
program demoOmpNumSet
use omp_lib
implicit none
integer, parameter :: omp_num_threads=16
integer, parameter :: size=10000, repeater=100
integer :: rep, i, j
real, allocatable :: a(:,:), b(:,:)
double precision :: t0, ti, t1, t2
allocate(a(size,size), b(size,size))
print *, "omp_get_num_threads: ", omp_get_num_threads()
print *, "omp_get_max_threads: ", omp_get_max_threads()
print *
print *, "omp_set_num_threads: ", omp_num_threads
call omp_set_num_threads(omp_num_threads)
print *
print *, "omp_get_num_threads: ", omp_get_num_threads()
print *, "omp_get_max_threads: ", omp_get_max_threads()
t0 = omp_get_wtime()
call random_number(b)
ti = omp_get_wtime()
do rep=1, repeater
!$omp parallel do
do j=1, size
do i=1, size
a(i,j) = b(i,j) * b(i,j)
enddo
enddo
!$omp end parallel do
enddo
t1 = omp_get_wtime()
do rep=1, repeater
do concurrent (j=1:size)
do i=1, size
a(i,j) = b(i,j) * b(i,j)
enddo
enddo
enddo
t2 = omp_get_wtime()
print *, "Time to initialize: ", ti-t0
print *, "Time OpenMP loop: ", t1-ti
print *, "Time DO-CONCURRNET: ", t2-t1
end program
I vary the omp_num_threads parameter in the different tests, also I adjust the OMP_NUM_THREADS environment variable or completely unset it. Then I observe the time the loop execution takes, but also watch how many threads are active on the system executing the code with top.
With IFORT (2021.8.0 20221119) both OMP and DO-CONCURRENT behave as expected, i.e. the value set by omp_set_num_threads() within the code determines the number of threads used by either approach, regardless whether or how OMP_NUM_THREADS environment variable is set:
OMP_NUM_THREAD=1, omp_num_threads=1:
omp_get_num_threads: 1
omp_get_max_threads: 1
omp_set_num_threads: 1
omp_get_num_threads: 1
omp_get_max_threads: 1
Time to initialize: 0.276350975036621
Time OpenMP loop: 3.36269402503967
Time DO-CONCURRNET: 3.35191202163696
OMP_NUM_THREAD=1, omp_num_threads=2:
omp_get_num_threads: 1
omp_get_max_threads: 1
omp_set_num_threads: 2
omp_get_num_threads: 1
omp_get_max_threads: 2
Time to initialize: 0.276911020278931
Time OpenMP loop: 2.78357601165771
Time DO-CONCURRNET: 2.87240791320801
OMP_NUM_THREAD=1, omp_num_threads=4:
omp_get_num_threads: 1
omp_get_max_threads: 1
omp_set_num_threads: 4
omp_get_num_threads: 1
omp_get_max_threads: 4
Time to initialize: 0.276546955108643
Time OpenMP loop: 2.19221901893616
Time DO-CONCURRNET: 2.16165018081665
unset OMP_NUM_THREAD, omp_num_threads=4:
omp_get_num_threads: 1
omp_get_max_threads: 256
omp_set_num_threads: 4
omp_get_num_threads: 1
omp_get_max_threads: 4
Time to initialize: 0.313963890075684
Time OpenMP loop: 2.22668409347534
Time DO-CONCURRNET: 2.30029201507568
However, with IFX (2023.0.0 20221201) things change! OMP still works as before, but the number of threads used by DO-CONCURRENT seem to be fixed by the environment. So if OMP_NUM_THREADS environment variable is set, it takes the value from there, and if unset, default to 256 on the compute nodes I am working on.
Notice also that the initialization time goes up, but I am not concerned about that here.
OMP_NUM_THREAD=1, omp_num_threads=1:
omp_get_num_threads: 1
omp_get_max_threads: 1
omp_set_num_threads: 1
omp_get_num_threads: 1
omp_get_max_threads: 1
Time to initialize: 1.28865694999695
Time OpenMP loop: 3.31528496742249
Time DO-CONCURRNET: 3.30231809616089
OMP_NUM_THREAD=1, omp_num_threads=2:
omp_get_num_threads: 1
omp_get_max_threads: 1
omp_set_num_threads: 2
omp_get_num_threads: 1
omp_get_max_threads: 2
Time to initialize: 1.28990292549133
Time OpenMP loop: 2.97741007804871
Time DO-CONCURRNET: 3.40985584259033
OMP_NUM_THREAD=1, omp_num_threads=4:
omp_get_num_threads: 1
omp_get_max_threads: 1
omp_set_num_threads: 4
omp_get_num_threads: 1
omp_get_max_threads: 4
Time to initialize: 1.28875398635864
Time OpenMP loop: 1.80025887489319
Time DO-CONCURRNET: 3.32655405998230
unset OMP_NUM_THREAD, omp_num_threads=4:
omp_get_num_threads: 1
omp_get_max_threads: 256
omp_set_num_threads: 4
omp_get_num_threads: 1
omp_get_max_threads: 4
Time to initialize: 1.28911781311035
Time OpenMP loop: 2.22033119201660
Time DO-CONCURRNET: 1.81347393989563
Obviously this is a very artificial example, and I am not really concerned about specific performance or anything here. In fact my main concern is to understand how we can set the number of threads available to DO-CONCURRENT from inside the executable. With IFORT this worked fine based on omp_set_num_threads(), but with IFX, this mechanism seems to no longer work. Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for sharing your reproducer. I started looking at a matmul example and see a similar performance difference with ifx between OMP and DO CONCURRENT.
The Fortran developers like to see multiple reproducers to test their fixes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I filed a bug report for this issue, CMPLRLLVM-52450. Will keep you posted on its progress to a fix.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Awesome, and thanks for letting me know. I am looking forward to seeing how this progresses. Thanks!
-Gerhard
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I learned something new about DO CONCURRENT and -qopenmp. For CPU the DO CONCURRENT is translated to OMP SIMD directives. So omp_set_num_threads() has no impact!
The parallel optimization team is working to implement DO CONCURRENT with OMP PARALLEL DO in a future release.
I'll keep you posted on the progress.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page