Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28892 Discussions

DO CONCURRENT with IFX 2023.0.0 20221201 uses OMP_NUM_THREADS over omp_set_num_threads()

Theurich
Novice
2,932 Views

I am experimenting with DO CONCURRENT under IFX 2023.0.0 20221201 for CPU level threading. I have noticed behavior that seems less intuitive than what I find under IFORT, and inconsistent with OpenMP behavior.

In short, it seems that under IFX the number of threads used by a DO CONCURRENT construct is equal to the setting of environment variable OMP_NUM_THREADS if set, and cannot be overridden by the omp_set_num_threads() API. Instrumenting the same loop with !$omp parallel do yields the expected results (i.e. the value set by omp_set_num_threads() taking priority, even for IFX). Also thread number for DO CONCURRENT under IFORT is consistent with what is set by omp_set_num_threads(), but not for IFX.

I can work around the issue by explicitly removing environment variable OMP_NUM_THREADS for IFX, then value set by omp_set_num_threads() API. is indeed used for the DO CONCURRENT construct.

0 Kudos
17 Replies
Ron_Green
Moderator
2,916 Views

Could be a bug.  I'll test it out and get an answer.  I too would expect omp_set_num_threads() to override OMP_NUM_THREADS for the do concurrent.

0 Kudos
Theurich
Novice
2,871 Views

Hi Ron, thank you for looking into this issue. Were you able to reproduce the problem on your end?
The latest IFX version I have available is 23.0.0. Maybe 23.2.0 has this resolved? Thanks.

0 Kudos
Ron_Green
Moderator
2,826 Views

This is much more complex that I thought.  For our Front End, we mark the loop as a parallel loop, along with other information related to the data used inside the loop, and pass that to our optimizer and parallelization passes.  This is where things get interesting.  If the DO CONCURRENT is inside an outer parallel do region, the parallel optimization phase has some choices.  Like you, I thought it would just OMP thread with PARALLEL DO.  but another choice is IF this inside an outer loop, see if the loops can be collapsed.  Then the preference is to vectorize the DO CONCURRENT and not thread it.  

This matches the general strategy of "parallelize the outermost loop, vectorize the innermost loop".  

So OMP threading may not be done at all!  It may just reduce the DO CONCURRENT to a normal vectorized loop.

 

What evidence do you have that in your case it is running the do concurrent as a threaded omp loop?

 

Tomorrow I hope to test a case like this.  Under Vtune.  This should show the threading behavior. 

0 Kudos
Steve_Lionel
Honored Contributor III
2,818 Views

I'll just throw in that the Fortran language does not require DO CONCURRENT to be run in parallel. Rather, it establishes conditions that permit parallelization. Vectorizing is a form of parallelization.

0 Kudos
Theurich
Novice
2,810 Views

Hi Steve, I do realize that the Fortran language standard does not require DO CONCURRENT to be run in parallel. The way I was arriving at my conclusions about OMP_NUM_THREADS taking priority over the value set through omp_set_num_threads() API, was as follows:

 

I have a test program that I use for testing that contains a double do loop over 200 x 100 elements. In the particular case I changed the outer (200 iterations) do loop to "do concurrent". The serial loop alone takes about 8s to execute. Long enough to watch it with top using 1s updates. 

 

I was using this test program for a while, and have found it convenient to change the number of threads available for OpenMP or DO CONCURRENT loops via the omp_set_num_threads() API from within the program. Watching with top I noticed that for IFX it still ran single threaded, even when setting omp_set_num_threads() to 2, 4, or 8 threads. I was baffled by this, because my experience with IFORT, and other Fortran compilers had been that I could change the number of threads this way. I finally noticed that by default, inside the interactive queue I was executing this, environment variable OMP_NUM_THREADS was set to 1. I just never bothered unsetting it before, because I was used to the omp_set_num_threads() setting to override what came from the OMP_NUM_THREADS environment variable. Well, I unset the OMP_NUM_THREADS variable, and voila, I started seeing the different number of threads running in the DO CONCURRENT loop (using top), according to what I am setting with omp_set_num_threads(). Also performance was scaling almost perfectly as expected with number of threads.

 

So while I agree that the Fortran standard does not require DO CONCURRENT to run in parallel, it seem that the compiler does actually generate code for it here, just that the OMP_NUM_THREADS=1 in the environment kept it at single threaded, regardless of what I am setting with omp_set_num_threads() from within the program itself.

 

So far I have only tested with 23.0.0, but have now access to 23.2.1. I will re-test with that version soon to see if anything might have changed.

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,780 Views

>> OMP_NUM_THREADS was set to 1

 

You did not mention that in your original post. Setting the OMP_NUM_THREADS environment variable sets the OMP_MAX_THREADS value. Thus, places an upper limit to the omp_set_num_threads(nn) value. Note, omp_set_num_threads(nn) upper value may also depend on if executed within a parallel region and IIF nested parallelism is enabled or not.

 

Jim Dempsey

0 Kudos
Theurich
Novice
2,778 Views

However, even with OMP_NUM_THREADS=1 in the environment:

  • IFORT accepts omp_set_num_threads(nn) with nn>1 fine for OpenMP loops.
  • IFORT accepts omp_set_num_threads(nn) with nn>1 fine for DO-CONCURRENT loops.
  • IFX accepts omp_set_num_threads(nn) with nn>1 fine for OpenMP loops.

Just the IFX for DO-CONCURRENT loops behaves differently. Again this is with 2023.0.0, and I am planning on testing with 2023.2.1 as soon as I can. But will have to wait 'till early next week for it.

But are you saying that the three cases listed above are actually violating the OMP_MAX_THREADS value set implicitly when OMP_NUM_THREADS is found in the environment?

0 Kudos
Ron_Green
Moderator
2,794 Views

@Theurich Testing with 23.2.1 is a good test.  We froze code for 23.0 around early October 2022.  Since then and up to code freeze for 2023.2.x we've put in 571 fixes.  In particular, DO CONCURRENT had many changes for functionality, bugs, AND performance leading up to 2023.2.0.  This includes the F2023 REDUCTION clause which could be important for you in the future.  Also, the locality-spec features, DEFAULT, LOCAL, LOCAL_INIT, SHARED, received a lot of attention in the first half of 2023, after 2023.0 released.  In short, a lot of work on DO CONCURRENT in 2023.  Now, will some of this affect your code?  Hard to say.  But I can say that if I had code with DO CONCURRENT, I would upgrade immediately. 

 

Let us know what you find.  And if find an issue, a code example could help us fix anything sub-par. 

 

0 Kudos
Theurich
Novice
2,698 Views

@Ron_Green I finally have access to 2023.2.1 (specifically: ifx (IFX) 2023.2.0 20230721), and I tested again with DO CONCURRENT. Still the same behavior: as long as `OMP_NUM_THREADS` is set in the environment, any change from within the program via `omp_set_num_threads(nn)` are ignored. However, as soon as I unset `OMP_NUM_THREADS`, all works as expected.

Not big deal to unset `OMP_NUM_THREADS`, but it is different from the OpenMP behavior, where the API call `omp_set_num_threads(nn)` from within the program takes priority  over what comes from the environment.

0 Kudos
Theurich
Novice
2,489 Views

It would be nice to get confirmation that the current behavior with IFX + OMP_NUM_THREADS + DO-CONCURRENT is not the intended behavior, and that we can expect future versions of IFX to eventually move toward the more consistent behavior as discussed.

I.e. calling omp_set_num_threads(nn) from inside the code will take priority over environment variable OMP_NUM_THREADS. As it is for all of the other cases as outlined above:

  • IFORT accepts omp_set_num_threads(nn) with nn>1 fine for OpenMP loops.
  • IFORT accepts omp_set_num_threads(nn) with nn>1 fine for DO-CONCURRENT loops.
  • IFX accepts omp_set_num_threads(nn) with nn>1 fine for OpenMP loops.

Thanks,
-Gerhard

0 Kudos
Barbara_P_Intel
Employee
2,468 Views

Sorry for the delay in investigating this. @Theurich, can you please share your test?

 

0 Kudos
JohnNichols
Valued Contributor III
2,452 Views

@Barbara_P_Intel , your comment reminded me of Tom Hanks and Wilson.  The package finally arrived.  

 

Thanks for the smile, it is better than Fortraning. 

Is Fortraning a real word?

Should it be Fortranning?

or FORTRANing?

0 Kudos
Theurich
Novice
2,441 Views

Consider the following code:

program demoOmpNumSet

  use omp_lib

  implicit none

  integer, parameter  :: omp_num_threads=16
  integer, parameter  :: size=10000, repeater=100
  integer             :: rep, i, j
  real, allocatable   :: a(:,:), b(:,:)
  double precision    :: t0, ti, t1, t2

  allocate(a(size,size), b(size,size))

  print *, "omp_get_num_threads: ", omp_get_num_threads()
  print *, "omp_get_max_threads: ", omp_get_max_threads()
  print *
  print *, "omp_set_num_threads: ", omp_num_threads
  call omp_set_num_threads(omp_num_threads)
  print *
  print *, "omp_get_num_threads: ", omp_get_num_threads()
  print *, "omp_get_max_threads: ", omp_get_max_threads()

  t0 = omp_get_wtime()

  call random_number(b)

  ti = omp_get_wtime()

  do rep=1, repeater
    !$omp parallel do
    do j=1, size
    do i=1, size
      a(i,j) = b(i,j) * b(i,j)
    enddo
    enddo
    !$omp end parallel do
  enddo

  t1 = omp_get_wtime()

  do rep=1, repeater
    do concurrent (j=1:size)
    do i=1, size
      a(i,j) = b(i,j) * b(i,j)
    enddo
    enddo
  enddo

  t2 = omp_get_wtime()

  print *, "Time to initialize: ", ti-t0
  print *, "Time OpenMP loop:   ", t1-ti
  print *, "Time DO-CONCURRNET: ", t2-t1

end program

I vary the omp_num_threads parameter in the different tests, also I adjust the OMP_NUM_THREADS environment variable or completely unset it. Then I observe the time the loop execution takes, but also watch how many threads are active on the system executing the code with top.

With IFORT (2021.8.0 20221119) both OMP and DO-CONCURRENT behave as expected, i.e. the value set by omp_set_num_threads() within the code determines the number of threads used by either approach, regardless whether or how OMP_NUM_THREADS environment variable is set:

 

OMP_NUM_THREAD=1, omp_num_threads=1:

omp_get_num_threads: 1
omp_get_max_threads: 1

omp_set_num_threads: 1

omp_get_num_threads: 1
omp_get_max_threads: 1
Time to initialize: 0.276350975036621
Time OpenMP loop: 3.36269402503967
Time DO-CONCURRNET: 3.35191202163696

 

OMP_NUM_THREAD=1, omp_num_threads=2:

omp_get_num_threads: 1
omp_get_max_threads: 1

omp_set_num_threads: 2

omp_get_num_threads: 1
omp_get_max_threads: 2
Time to initialize: 0.276911020278931
Time OpenMP loop: 2.78357601165771
Time DO-CONCURRNET: 2.87240791320801

 

OMP_NUM_THREAD=1, omp_num_threads=4:

omp_get_num_threads: 1
omp_get_max_threads: 1

omp_set_num_threads: 4

omp_get_num_threads: 1
omp_get_max_threads: 4
Time to initialize: 0.276546955108643
Time OpenMP loop: 2.19221901893616
Time DO-CONCURRNET: 2.16165018081665

 

unset OMP_NUM_THREAD, omp_num_threads=4:

omp_get_num_threads: 1
omp_get_max_threads: 256

omp_set_num_threads: 4

omp_get_num_threads: 1
omp_get_max_threads: 4
Time to initialize: 0.313963890075684
Time OpenMP loop: 2.22668409347534
Time DO-CONCURRNET: 2.30029201507568


However, with IFX (2023.0.0 20221201) things change! OMP still works as before, but the number of threads used by DO-CONCURRENT seem to be fixed by the environment. So if OMP_NUM_THREADS environment variable is set, it takes the value from there, and if unset, default to 256 on the compute nodes I am working on.

Notice also that the initialization time goes up, but I am not concerned about that here.

 

OMP_NUM_THREAD=1, omp_num_threads=1:

omp_get_num_threads: 1
omp_get_max_threads: 1

omp_set_num_threads: 1

omp_get_num_threads: 1
omp_get_max_threads: 1
Time to initialize: 1.28865694999695
Time OpenMP loop: 3.31528496742249
Time DO-CONCURRNET: 3.30231809616089

 

OMP_NUM_THREAD=1, omp_num_threads=2:

omp_get_num_threads: 1
omp_get_max_threads: 1

omp_set_num_threads: 2

omp_get_num_threads: 1
omp_get_max_threads: 2
Time to initialize: 1.28990292549133
Time OpenMP loop: 2.97741007804871
Time DO-CONCURRNET: 3.40985584259033

 

OMP_NUM_THREAD=1, omp_num_threads=4:

omp_get_num_threads: 1
omp_get_max_threads: 1

omp_set_num_threads: 4

omp_get_num_threads: 1
omp_get_max_threads: 4
Time to initialize: 1.28875398635864
Time OpenMP loop: 1.80025887489319
Time DO-CONCURRNET: 3.32655405998230

 

unset OMP_NUM_THREAD, omp_num_threads=4:

omp_get_num_threads: 1
omp_get_max_threads: 256

omp_set_num_threads: 4

omp_get_num_threads: 1
omp_get_max_threads: 4
Time to initialize: 1.28911781311035
Time OpenMP loop: 2.22033119201660
Time DO-CONCURRNET: 1.81347393989563

 

Obviously this is a  very artificial example, and I am not really concerned about specific performance or anything here. In fact my main concern is to understand how we can set the number of threads available to DO-CONCURRENT from inside the executable. With IFORT this worked fine based on omp_set_num_threads(), but with IFX, this mechanism seems to no longer work. Thanks.

 

Barbara_P_Intel
Employee
2,306 Views

Thank you for sharing your reproducer. I started looking at a matmul example and see a similar performance difference with ifx between OMP and DO CONCURRENT.

The Fortran developers like to see multiple reproducers to test their fixes.

 

0 Kudos
Barbara_P_Intel
Employee
2,278 Views

I filed a bug report for this issue, CMPLRLLVM-52450. Will keep you posted on its progress to a fix.



0 Kudos
Theurich
Novice
2,260 Views

Awesome, and thanks for letting me know. I am looking forward to seeing how this progresses. Thanks!

-Gerhard

0 Kudos
Barbara_P_Intel
Employee
2,146 Views

I learned something new about DO CONCURRENT and -qopenmp. For CPU the DO CONCURRENT is translated to OMP SIMD directives. So omp_set_num_threads() has no impact!

The parallel optimization team is working to implement DO CONCURRENT with OMP PARALLEL DO in a future release.

I'll keep you posted on the progress.



0 Kudos
Reply