"Nested" thread allocation through OpenMP?

Ioannis_K_ · ‎07-08-2023

Hello, I have a potentially advanced question regarding multi-threading and openMP.

I have a system with 128 threads, and I need to run a program, which involves the following hierarchy:

!========================================================
Call SUB1

Call SUB2

Call SUB3

Call SUB4

!========================================================================

Subroutine SUB1

CALL SUB1_1

CALL SUB1_2

...

CALL SUB1_10

end

Subroutine SUB2

CALL SUB2_1

CALL SUB2_2

...

CALL SUB2_10

end

Subroutine SUB3

CALL SUB3_1

CALL SUB3_2

...

CALL SUB3_10

end

Subroutine SUB4

CALL SUB4_1

CALL SUB4_2

...

CALL SUB4_10

end

!=========================================

Now, each of the subroutines SUBi_j, with i = 1,...,4 and j = 1,...,10 has the form:

SUBROUTINE SUBi_j

!$OMP PARALLEL DO
......
!$OMP END PARALLEL DO

end

!==============================================

The subroutines are independent from each other (in terms of data).

Right now, my program runs the 4 outer subroutines "SUBi", i=1,...,4 sequentially; when inside SUBi, it calls each of the "SUBi_j subroutines" sequentially, using OpenMP for parallelizing the DO loops in the SUBi_j routines.

My question is as follows: is there a way to use openMP so that the "outer-level subroutines" SUB1, SUB2, SUB3 and SUB4 are executed concurrently, and each one of these subroutines is assigned 1/4 of my total threads (that is, 32 threads per subroutine),
and then use its subset of threads for the parallel do-loops in the "inner-level" subroutines SUBi_j?

I would imagine this can be done by combining MPI (for the outer-level thread subdivision) and openMP, but I was wondering whether it can be done using openMP alone.

jimdempseyatthecove · ‎07-08-2023

Set the environment variable:

OMP_MAX_LEVELS=2

Do not set environment variable for number of threads

for your 1st level

!$omp parallel num_threads(4)

!$omp sections

!$omp section

call sub1(...)

!$omp section

call sub2(...)

!$omp section

call sub3(...)

!$omp section

call sub4(...)

!$omp end sections

!$omp end parallel

Caution, do not run any other 1st level parallel region with other than 4 threads max. It will/may hurt performance if you do so.

For each level-2 parallel region, include the PARALLEL clause num_threads(32).

Note, to be a little more flexible, call omp_get_max_threads() prior to 1st parallel region, then set the thread limits for each level dependent upon experiments performed. For example, define module variables nThreadsLevel1 and nThreadsLevel2.

I would expect that the total work performed by sub1 differs from that of sub2, sub3 and sub4, and work of sub2 differs from that of sub3 and subr, and work of sub3 differs from sub4. If so, then partitioning like this might not be optimal (but at least it is a start).

Jim Dempsey

Ioannis_K_ · ‎07-09-2023

Thanks Jim.

This is exactly what I needed, and I agree that the best strategy is to allow some flexibility in terms of numbers of threads.

One follow-up theoretical question that I have (which is of no significance for my own code):

Would it be possible to also pursue a similar approach with "3 levels" of parallelism? For instance, set OMP_MAX_LEVELS=3, then define 3 variables (in a module), called nthread1, nthread2, nthread3, so that nthread1*nthread2*nthread3 = OMP_NUM_THREADS

(Note: I do not set the number of threads, so I assume that the default value of OMP_NUM_THREADS is the maximum number of threads available in my system).

Then, we continue with the same type of parallelism as the one you provided for level 1 (with num_threads = nthread1)

then have at a second level - let's say, inside SUB1:

!$omp parallel num_threads(nthread2)

!$omp sections

!$omp section

call sub1_1(...)

!$omp section

call sub1_2(...)

!$omp section

call sub1_3(...)

!$omp section

call sub1_4(...)

.........................

!$omp section

call sub1_10(...)

!$omp end sections

!$omp end parallel

And then, in level 3 (the innermost level), I keep the PARALLEL DO loops inside the subroutines SUBi_j, but set num_threads(nthread3).

Would this approach work?

Again, thank you very much for all your help and advice.

jimdempseyatthecove · ‎07-09-2023

Yes, also remember to set OMP_MAX_LEVELS=3

Before you do this, use VTune to profile the system.

Using too fine of partitioning will have a negative impact on performance.

You will need to determine if there is sufficient work to be done on the next level(s) to amortize the entry/exit overhead of that/those parallel regions.

Another option you have is to use the OpenMP TASK ***

*** but you must be careful not to oversubscribe the threads

!$omp parallel num_threads(nthread2)
!$omp master
!$omp task
call sub1_1(...)
!$omp end task
!$omp task
call sub1_2(...)
!$omp end task
!$omp task
call sub1_3(...)
!$omp end task
!$omp task
call sub1_4(...)
!$omp end task
!$omp end master
!$omp end parallel

You can use non-task omp constructs as well as task constructs within a task/non-task region ***

*** however, keep in mind that all threads issuing a new task level or nest level constructs a new team for that thread at that level.

IOW, if your tree isn't ballanced, and if the nest level threads on the second iteration are not the same as the first iteration, this may expand the number of team member threads for that thread team. This could lead to resource overhead and/or nunecessary system overhead in dealing with an application that has oversubscription of threads.

Ioannis_K_ · ‎07-10-2023

Thanks Jim.

I tried to create an example simple program, to experiment with the "nested" parallelism. I am attaching the source file for the specific program. The program calls 4 subroutines, and each subroutine has a PARALLEL DO loop. I have a module with several variables, including nthrea1 and nthrea2 (outer and inner numbers of threads, respectively).

I use the SECTIONS construct, so that each of the 4 subroutines is called by a different thread (so I have set nthrea1 = 4). The inner number of threads is 2 (nthrea2 = 2), so that each PARALLEL DO loop uses 2 threads.

I noticed two issues that I wanted to ask if they can be somehow resolved:

1) The omp_get_thread_num ( ) function always returns 0 when it is called from the inner-level parallel do loops,

2) I have also added an ATOMIC construct in one of the inner-level subroutines (related to a variable which is only accessed by that subroutine). I noticed that this ATOMIC construct is IGNORED. The variable isum2 is initialized to 0 and should be having a value of 100 at the end of the program (as I have a PARALLEL DO LOOP with 100 iterations in subroutine SUB2, and this loop includes the line "isum2 = isum2 + 1"

I tried using a CRITICAL construct instead, and it is ignored once again. Is there a way to fix this?

Ioannis_K_ · ‎07-10-2023

OK, I think I just found my mistake:

I should not be creating OMP PARALLEL directives around the OMP PARALLEL DO loops. The code that I had created 2 threads, and each thread executed a DO LOOP with 100 iterations. What I want instead is a single DO loop whose iterations are handled by 2 threads.

So, the incorrect code for SUB2 that I had in my previously attached source file should be corrected to:

!================================================================================

subroutine sub2
use mo1
use omp_lib
implicit none

integer i1,ithr1
isum2 = 0

!$OMP PARALLEL DO num_threads(nthrea2),PRIVATE(ithr1)
do i1 = 1,100
ithr1 = 1+omp_get_thread_num ( )
write(*,*) 'sub2',i1,ithr1
isum1 = isum1 + 1

!$OMP ATOMIC
isum2 = isum2 + 1
!$OMP END ATOMIC
end do !i1
!$OMP END PARALLEL DO

end subroutine sub2

!==============================================================================

Similar corrections must be made to the other 3 subroutines SUB1, SUB3 and SUB4 in my previously attached file.

jimdempseyatthecove · ‎07-10-2023

>>1) The omp_get_thread_num ( ) function always returns 0 when it is called from the inner-level parallel do loops

omp_get_thread_num() returns the team member number... i.e. thread number of the team (numbered 0:nThreads-1)

2) I ran the test and this is what I found:

a) You need to add

call OMP_SET_NESTED(.true.)

before the call to CALL OMP_SET_MAX_ACTIVE_LEVELS (2)

b) Remove the "PARALLEL" on the DO lines.

      !$omp parallel num_threads(nthrea2)
      !$OMP DO PRIVATE(ithr1)
      do i1 = 1,100
          ithr1 = 1+omp_get_thread_num ( )
          write(*,*) 'sub1',i1,ithr1  
          isum1 = isum1 + 1 ! *** not thread-safe
      end do !i1
      !$omp end parallel

      !$omp parallel num_threads(nthrea2)
      !$OMP DO PRIVATE(ithr1)
      do i1 = 1,100
          ithr1 = 1+omp_get_thread_num ( )
          write(*,*) 'sub1',i1,ithr1  
          !$omp atomic
          isum1 = isum1 + 1 ! *** thread-safe
          !$omp end atomic
      end do !i1
      !$omp end parallel

keeping the PARALLEL tries to add a 3rd level, which you denied with CALL OMP_SET_MAX_ACTIVE_LEVELS (2)

Note, with the write included .and. loop count of 100, the likelyhood of adverse interaction is low. IOW the fact that you may not see a collision in isum1 = isum1+1 is not conclusive of there not being the potential for collision. Suggest you comment out the write(s) and increase the loop counts to 10000. (also print out value of isum1 at the end of the program)

Jim Dempsey

TobiasK · ‎07-11-2023

Hi @Ioannis_K_ , @jimdempseyatthecove

a side note on omp sections and omp tasks:

It's not guaranteed which thread picks up which task. If you want to have control over this, you have to do it manually, e.g.:

my_id=0
!$OMP parallel num_threads(nthread1) private(my_id)
my_id=omp_get_thread_num()

if(my_id.eq.0.or.nthread1.eq.1) call sub1
if(my_id.eq.1.or.nthread1.eq.1) call sub2
...
!$omp end parallel

Also you might want to take into consideration NUMA effects. In such situations proc_bind(spread) in the outer most parallel region is a handy way to place the threads in different NUMA regions. In the subsequent parallel region you then choose proc_bind(close)

(If your application is heavily compute bound with little memory access you may ignore NUMA and just go on with sections/tasks.)

!$omp parallel num_threads(nthreads1) proc_bind(spread)
...
if(my_id.eq.1)then
!$omp parallel do num_threads(nthreads2) proc_bind(close)
do i=1,100
...
!$omp end parallel do
endif
...

That way you should get the same pinning and control as if you would use MPI for the outer parallel regions.

If you know that the workload in the subroutines is different, you may also think about dropping nested OpenMP and instead do the work sharing manually and let the subroutines be called with different number of threads, more or less completely how you would distribute work in a MPI program.

methread=0
!$OMP parallel num_threads(nthread1) private(my_id)
my_id=omp_get_thread_num()

if((my_id.ge.0.and.my_id.le.3).or.nthread1.eq.1) call sub1(my_id)
if((my_id.ge.4.and.my_id.le.5).or.nthread1.eq.1) call sub2(my_id)
...
!$omp end parallel

...
subroutine sub1(my_id)

!do i=1,100
chunk_size=100/3
my_start=my_id*chunk_size+1
my_end=(my_id+1)*chunk_size
my_size=my_end-my_start+1
if(mod(100,3).ne.0)then
   my_start=my_start+min(my_id,mod(100,3))
   my_end=min(my_end+min((my_id+1),mod(100,3)),100)
end if

However, note that you cannot use !$omp barrier inside the subroutines if not all threads are calling a subroutine, as the barrier needs to be encountered by all threads inside a team. The same holds for reductions.

jimdempseyatthecove · ‎07-11-2023

A potential problem with using my_id is you are not assured that the parallel region will run with the requested number of threads.

You are assured that regardless of number of threads (or lack of parallel region, or lack of expanding of !$omp...) that the !$OMP sections will execute all sections regarless of the number of threads.

Jim Dempsey

TobiasK · ‎07-11-2023

of course that approach needs extra care. However, what ever you specify in num_threads() will be used.

(Scaling a not-so compute bound algorithm with OpenMP across 128 cores with various number of NUMA domains will need a lot of care anyway.)

Barbara_P_Intel · ‎07-11-2023

ATOMIC is expensive. Will a REDUCTION clause on the !$OMP PARALLEL work instead? The reduction is done at the end of the loop.

Ioannis_K_ · ‎07-11-2023

The actual code where I will apply nested parallelization has operations on shared, 1D array variables, which are obtained by summing contributions from smaller arrays, and multiple threads may need to work on the same location of the array. Is it possible to use REDUCTION clause with array variables? Also, would I be expecting much benefit by using REDUCTION instead of ATOMIC?

It may be important to mention that the arrays I am operating on are allocatable (they are allocated at runtime), defined in a module, and they are very large.

jimdempseyatthecove · ‎07-11-2023

While you can use reduction on arrays, this can be problematic when the arrays are exceedingly large. (iow stack issues).

It is usually better to produce a method (e.g. tiles) that eliminates conflicts (or constricts conflicts to perimiters of the tile).

Jim Dempsey

Ioannis_K_ · ‎07-11-2023

Tiling is not possible/beneficial for the code that I am running. If the main issue with reduction is stack, can't I just use heap arrays and resolve it?

So, would the following, sample code would actually work without any issue?

Let us assume that I have properly allocated the (large) array A(:), and also the smaller arrays k(:,:)

!$OMP PARALLEL DO PRIVATE(ithr1), reduction( + : A )

do ii = 1,1000

ithr1 = 1 + omp_get_thread_num ( )

!.................... additional code not shown here, computing the values for array k() - each thread ithr1 calculates the values k(:,ithr1)

do i1 = 1,nk ! nk is the 1st dimension of arrays k

i2 = IA(i1)

A(i2) = A(i2) + k(i1,ithr1)
end do !i1

end do !ii

TobiasK · ‎07-11-2023

the problem is that every thread will get a private copy of the array which is placed on the stack.
Each thread will work with the private copy and only at the end of the reduction construct the reduction across the team of threads is happening.
If you want to use heap arrays you have to do it again manually and allocate an array for each thread (or a global array with a nthread dimension)

It very much depends on what you actually do, if you really need an atomic update or if you can avoid it, but as Barbara mentioned, avoid atomic, avoid critical, avoid single, those constructs may really badly impact the performance.

Ioannis_K_ · ‎07-11-2023

I thought that it is possible to use heap arrays instead of stack, correct? Why would I have to manually dimension the separate arrays for the heap? I would imagine that all I have to do to prevent potential problems with stack is to use the compiler option

-heap-arrays 0

At any rate, our discussion now is for using REDUCTION for a large array. If I manually add an extra dimension to the array that is accessed by all threads, then I do not see any meaning in using a REDUCTION clause anyway; we will know that different threads will be accessing different memory locations, so no overlap will occur.

jimdempseyatthecove · ‎07-11-2023

Barbara,

The variable being incremented may be incremented by different parallel regions concurrently. Nested parallel regions is in effect, and different threads from the upper region may be in play.

What could be done is to use a temporary tally in the nested parallel loop using the REDUCTION clause, then after the parallel region use the !$OMP atomic to accumulate the local tally into the global tally counter.

Jim Dempsey

Ioannis_K_ · ‎07-11-2023

Jim, another (slightly irrelevant) question that I had pertains to breaking lines in !$OMP directives:

If I have a fixed-form Fortran source file, is the compiler able to understand a multiple-line openMP directive?

I am asking because I may end up having !$OMP PARALLEL DO statements which will simultaneously include NTHREADS, PRIVATE and REDUCTION clauses. I guess I can always increase the fixed form line length if necessary, but I was just curious on whether the compiler will be able to read a line break in !$OMP directives.

jimdempseyatthecove · ‎07-11-2023

in fixed form you would be using C$OMP... at column 1, continuation line is C$OMP+...

16 !$omp parallel do shared(a,b,c)
17
18 c$omp parallel do
19 c$omp+shared(a,b,c)
20
21 c$omp paralleldoshared(a,b,c)

Three different ways shown above

Jim Dempsey