Solved: Jim,

Tue_B_ · ‎08-10-2015

I have built this simple little program in an attempt to succesfully switch between nested and non-nested parallelism. This program is designed for a NUMA system with at least 6 numanodes, with 4 cores each, but if you have fewer NUMAnodes just change the number in the beginning...

So far I have been unable to figure out how to correctly combine Nested and non-nested parallelism.

Just a quick explanation of what this program does:

It sets the affinity for nested parallelism
Then it runs a nested parallel region (6 numa nodes with 4 cores on master thread and 3 on the others). (Here the load is distributed as desired across the NUMA nodes)
Then it runs a non-nested parallel region with 6*4 threads scattered across the NUMAnodes. (Here the distribution also works as intended)
Then it runs a second nested parallel region exactly similar to the first nested parallel region. (Now the load distribution is all fucked up, and all the work ends up on 2 of the NUMA nodes while the rest does nothing.)
So does anyone know how to switch between these different types of parallel regions? Is what I'm seeing a bug or what exactly is happening here?

program NumaAwareDGEMM
 use IFPORT
 use omp_lib
 use mkl_service
 use mTEST
implicit none

 logical(4) :: Success
 integer :: NoNUMANodes, blocksize,nrepeats,Runmode,t0
 integer :: N,I,J,NIte, First,Last,k,colidx,error,numofblocks,iii,ii,dim,d,threadID,NumaID
 integer :: Iter,Solver,NUMASize,m,ThreadsPrNuma,ID, NCPU
 integer, allocatable,dimension(:) :: GlobalThreadID
 real*8,allocatable,dimension(:,:) :: A, B,C1,c2,c3,c4,c5,c6,c7,c8
 real*8,allocatable,dimension(:,:,:) :: C
 real*8,allocatable,dimension(:)    :: tmp
 logical, allocatable, dimension(:) :: NumaNodeDone,MKlbusy


 NoNUMANodes=6                     !How many NUMA nodes to distribute calculations over
 NCPU=6*4
success = SETENVQQ("OMP_DISPLAY_ENV=TRUE")
success=SETENVQQ("OMP_PLACES={0:6},{6:6},{12:6},{18:6},{24:6},{30:6},{36:6},{42:6}")
!success=SETENVQQ("OMP_PLACES={0:6},{6:6},{12:6},{18:6},{24:6},{30:6}")
!success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8},{40:8}")

 blocksize=600
 dim=blocksize*NoNUMANodes
 allocate(A(dim,dim))
 allocate(B(dim,dim))
 allocate(C1(dim,dim))
 allocate(C2(dim,dim))
 allocate(C3(dim,dim))
 allocate(C4(dim,dim))
 allocate(C5(dim,dim))
 allocate(C6(dim,dim))
 allocate(C7(dim,dim))
 allocate(C8(dim,dim))
 allocate(tmp(NCPU))
 call KMP_SET_STACKSIZE_S(990000000)
 call omp_set_dynamic(0)
 call omp_set_nested(1)
   !intialization region
   call omp_set_num_threads(NoNUMANodes) !First we spawn all the threads in a threadpool
   !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ID)
   !$OMP DO SCHEDULE(STATIC)
   do i = 1,NoNUMANodes
      ID=omp_get_thread_num()
      print *,'Thread binding for socket=',ID
      if(i-1.ne.ID) print*,'ERROR on ID',ID,'i=',i
      SELECT CASE (i)
        CASE(1)
!          success=SETENVQQ("OMP_PLACES={0:8}")
          success=SETENVQQ("OMP_PLACES={0:6}")          
        CASE(2)
!          success=SETENVQQ("OMP_PLACES={8:8}")
          success=SETENVQQ("OMP_PLACES={6:6}")
        CASE(3)
!          success=SETENVQQ("OMP_PLACES={16:8}")
          success=SETENVQQ("OMP_PLACES={12:6}")
        CASE(4)
          success=SETENVQQ("OMP_PLACES={18:6}")
!          success=SETENVQQ("OMP_PLACES={24:8}")
        CASE(5)
          success=SETENVQQ("OMP_PLACES={24:6}")
!          success=SETENVQQ("OMP_PLACES={32:8}")
        CASE(6)
          success=SETENVQQ("OMP_PLACES={30:6}")
!          success=SETENVQQ("OMP_PLACES={40:8}")
        CASE(7)
          success=SETENVQQ("OMP_PLACES={36:6}")
!          success=SETENVQQ("OMP_PLACES={48:8}")
        CASE(8)
          success=SETENVQQ("OMP_PLACES={42:6}")
 !         success=SETENVQQ("OMP_PLACES={56:8}")
      END SELECT 
   end do
   !$OMP END DO
   !$OMP END PARALLEL  
    print*,'Initialization over'   
   ! 
    call omp_set_num_threads(NoNUMANodes) !Now outer parallelization over numa nodes
   !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ID)  
   !$OMP DO SCHEDULE(STATIC)
   do i = 1,NoNumanodes
      ID=omp_get_thread_num()       
      SELECT CASE (i)
        CASE(1)
          call Products(dim,A,B,C1)
        CASE(2)
          call Products(dim,A,B,C2)
        CASE(3)
          call Products(dim,A,B,C3)
        CASE(4)
          call Products(dim,A,B,C4)
        CASE(5)
          call Products(dim,A,B,C5)
        CASE(6)
          call Products(dim,A,B,C6)
        CASE(7)
          call Products(dim,A,B,C7)
        CASE(8)
          call Products(dim,A,B,C8)
      END SELECT 
   end do
   !$OMP END DO
   !$OMP END PARALLEL  
   print*,'First Nested done '

    print*,'Starting single parallel region'
   call omp_set_num_threads(NCPU)   
    print*,'Proc_bind',omp_get_proc_bind()
   !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,j,k)  proc_bind(Spread)
   !$OMP DO SCHEDULE(STATIC)
   do i=1,NCPU
    k=0
    do j=1,1000000000
     k=k+exp((i*1d0))*exp(-(i*1d0))+(j**2)      
    end do
    tmp(i)=k
   end do
   !$OMP END DO
   !$OMP END PARALLEL  

   print*,'Single parallel region done'


    call omp_set_num_threads(NoNUMANodes) !Now outer parallelization over numa nodes   
   !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i) proc_bind(Spread)
   !$OMP DO SCHEDULE(STATIC)
   do i = 1,NoNumanodes
      SELECT CASE (i)
        CASE(1)
          call Products(dim,A,B,C1)
        CASE(2)
          call Products(dim,A,B,C2)
        CASE(3)
          call Products(dim,A,B,C3)
        CASE(4)
          call Products(dim,A,B,C4)
        CASE(5)
          call Products(dim,A,B,C5)
        CASE(6)
          call Products(dim,A,B,C6)
        CASE(7)
          call Products(dim,A,B,C7)
        CASE(8)
          call Products(dim,A,B,C8)
      END SELECT 
   end do
   !$OMP END DO
   !$OMP END PARALLEL  

    end program NumaAwareDGEMM

  module mTEST
   use omp_lib

    contains
    subroutine Products(n,A,B,C)
    implicit none
    real*8,dimension(:,:)  :: A,B,C
    integer :: n
    integer  :: i,j,k,ID

    ID=omp_get_thread_num()
    if (ID.eq.0) then
    call omp_set_num_threads(4) !Inner parallelization
    else 
    call omp_set_num_threads(3) !Inner parallelization    
    end if

   !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i) PROC_BIND(MASTER) 
   !$OMP DO SCHEDULE(STATIC)
    do i=1,n
     do j=1,n
      do k=1,n
        C(i,j)=A(i,j)*B(j,k)
      end do
     end do
    end do
   !$OMP END DO
   !$OMP END PARALLEL  

    end subroutine Products
end module mTEST

Andrey_C_Intel1 · ‎08-14-2015

Hi Tue B.

Your problem looks like a known bug we've fixed recently, and the fix will be available with new compiler releases coming soon.

As a workaround you can try to set number of threads to value lesser than NoNUMANodes before decreasing it from NCPU to NoNUMEModes, e.g. set it to 1:

   print*,'Starting second Nested region'
   call omp_set_num_threads(1)
   call omp_set_num_threads(NoNUMANodes)

The bug was that we haven't rebound threads in case the team size reduced. Now if you set it to 1, then to NoNumaNodes, the team size will be increased, and threads will be correctly bound to processors.

One more comment: you may set threads locations via OMP_PLACES=cores instead of listing explicit proc numbers, it is easier, more portable, and less error prone. Then OMP_PROC_BIND=spread will bind threads evenly across available resources, and you will get good threads binding in case team size corresponds with number of numa nodes (e.g. 1 thread per node, or N threads per node). Setting the same place (e.g. {0:6}) to multiple threads you allow OS to freely migrate threads between procs 0 - 5, and it may sometimes place several threads to the same proc with poor performance of cause. Given separate core (or hardware thread) to each OpenMP thread you can avoid such problems.

Regards,
Andrey

View solution in original post

Steven_L_Intel1 · ‎08-12-2015

Bumping this so that OpenMP experts might comment.

TimP · ‎08-12-2015

Among the bugs is the awkward nesting of the matrix multiply. Under omp, compilers will not optimize the loop nesting even to the extent it may be possible outside an omp region. If you don't care to pay attention to this, you are much better off with MATMUL and the built-in opt-matmul parallel MKL support. Even if you do pay attention, it will be difficult to approach MKL threaded performance.

MKL is designed to perform well with a single MATMUL spread over at least 2 numa nodes (if the matrix is big enough). So it will be difficult to justify the trouble you have gone to.

The numbering in OMP_PLACES would be by logical threads rather than cores, unless you have disabled HyperThread. I don't see how the numbering there corresponds with what little detail you gave about your target system.

Tue_B_ · ‎08-12-2015

Tim Prince wrote:

Among the bugs is the awkward nesting of the matrix multiply. Under omp, compilers will not optimize the loop nesting even to the extent it may be possible outside an omp region. If you don't care to pay attention to this, you are much better off with MATMUL and the built-in opt-matmul parallel MKL support. Even if you do pay attention, it will be difficult to approach MKL threaded performance.

MKL is designed to perform well with a single MATMUL spread over at least 2 numa nodes (if the matrix is big enough). So it will be difficult to justify the trouble you have gone to.

The numbering in OMP_PLACES would be by logical threads rather than cores, unless you have disabled HyperThread. I don't see how the numbering there corresponds with what little detail you gave about your target system.

Okay so there is quite a lot to this, so let me try and justify each thing by it self.

First - this is not about matrix multiplications, I merely use them to make a simple showcase of the problem I'm dealing with.

Second - this topic is very specifically not about MKL, but purely about openMP. MKL inside openMP opens up a whole new set of problems, which I'm still awaiting reply on in this thread (any insight or poking the relevant people regarding that problem would also be greatly appreciated):

https://software.intel.com/en-us/forums/topic/564569

Regarding the numbering in my OMP_places I can see how my initial post can be confusing so let me try and clarify:

I currently have two NUMA systems available to me for testing. One system contains 8 NUMA nodes with 8 cores on each (64 cores in total), the other system contains 8 NUMA nodes with 6 cores on each (48 cores in total).

The above code is currently set for the 48 cores system. (The commented out omp_places is merely settings I use when testing on the 64 core system).

The reason why I wrote that it was designed for a system with at least 6 numa nodes with 4 cores each, was simply due to the number of threads the program was currently set to spawn, I'm sorry for being unclear about this in the initial comment.

I hope that clears up the details about the system I am running on and the problems I am encountering.

I really hope someone can help since we desperately need to be able to switch between these different regions of parallelization in the code we are developing.

jimdempseyatthecove · ‎08-13-2015

In your nested region, j and k also need to be private.

I haven't experimented with PROC_BIND(MASTER).... however the term is fraught with ambiguity.

On !$OMP PARALLEL PROC_BIND(MASTER)

Which context does MASTER refer to??

a) the master thread prior to the statement
b) the current thread prior to the statement (if so, then why is the keyword MASTER used)
c) what will become the master thread of the new parallel region (same as b)

Your first parallel region is not doing what you think it is doing (or you hadn't sketched the code properly)....
Threads do not own the environment. The process owns the environment (IOW it is shared amongst the threads in the process).
Your first SETENVQQ (prior to first parallel region), specifying each thread of the to be constructed outermost parallel region, is to occupy any 6 of the logical processors in groups of 6 logical processors (this assumes HT disabled). Therefore, once the outer parallel region is established, the subsequent !$OMP PARALLEL PROC_BIND(MASTER) to create each thread's nested region (first time), and assuming b) above, there is no requirement to muck with the SETENVQQ.

Also, why are you reducing the thread counts for the nested regions instantiated by the non-master outer region threads?

Jim Dempsey

TimP · ‎08-13-2015

If you wish to be certain about public/private status, you will use default(none) and specify each variable as shared or private. I think default(shared) doesn't affect the normal Fortran OpenMP rule that all the do indices default to private (which is different from C OpenMP).

Intel clouded the picture by having default shared cilk_for indices in early versions of Cilk(tm) Plus for C without compiler warning. One might think that a compile time warning for non-private indices would be possible in OpenMP as well as cilk_for.

Tue_B_ · ‎08-13-2015

jimdempseyatthecove wrote:

In your nested region, j and k also need to be private.

I haven't experimented with PROC_BIND(MASTER).... however the term is fraught with ambiguity.

On !$OMP PARALLEL PROC_BIND(MASTER)

Which context does MASTER refer to??

a) the master thread prior to the statement
b) the current thread prior to the statement (if so, then why is the keyword MASTER used)
c) what will become the master thread of the new parallel region (same as b)

Your first parallel region is not doing what you think it is doing (or you hadn't sketched the code properly)....
Threads do not own the environment. The process owns the environment (IOW it is shared amongst the threads in the process).
Your first SETENVQQ (prior to first parallel region), specifying each thread of the to be constructed outermost parallel region, is to occupy any 6 of the logical processors in groups of 6 logical processors (this assumes HT disabled). Therefore, once the outer parallel region is established, the subsequent !$OMP PARALLEL PROC_BIND(MASTER) to create each thread's nested region (first time), and assuming b) above, there is no requirement to muck with the SETENVQQ.

Also, why are you reducing the thread counts for the nested regions instantiated by the non-master outer region threads?

Jim Dempsey

In my nested region j and k is not present so I'm not sure why I would need to specify them? (I mean they are local variables in the subroutine Products which is called in the nested region, but surely that can't mean they need to be explicitly set to private in the parallelization?)

I just read up on proc_bind(master) and you are correct in that it is not what I want, instead I should use proc_bind(close). As far as I can tell the master thread is the original thread that spawned the program, which is obviously not what I want, whereas proc_bind(close) refers to the parent thread.

The reason why I do the nested SETENVQQ(OMP_PLACES) is due to the results we found in this thread:

https://software.intel.com/en-us/comment/1823147#comment-1823147

But I think I see your point. What you are saying is that OMP_PLACES is a global environment variable, which affect all parallel regions on all levels, right? (in contradiction to a parallel-local variable I assumed here, which would only affect the next level of parallelization)

Finally the reason why I'm reducing the number of threads on the non-master node was purely to see if they behaved as expected.

Anyway based on the insights and suggestion you just gave me Jim I will make a new test program immediately and report back when done.

Tue_B_ · ‎08-13-2015

Alright based on Jims input, I made a new test case (and cleaned it up a bit so it will hopefully be easier to understand). The new test case unfortunately shows the same load imbalance problems as the original one.

program NumaAwareDGEMM
 use IFPORT
 use omp_lib
 use mTEST
implicit none

 logical(4) :: Success
 integer :: i,j,k,blocksize,NCPU,NoNumanodes,dim,ID
 real*8,allocatable,dimension(:,:) :: A, B,C1,c2,c3,c4,c5,c6,c7,c8
 real*8,allocatable,dimension(:,:,:) :: C
 real*8,allocatable,dimension(:)   :: tmp   

 NoNUMANodes=6                     !How many NUMA nodes to distribute calculations over.
 NCPU=6*4                          !Number of CPUs to run non-nested parallel regions on.
 success = SETENVQQ("OMP_DISPLAY_ENV=TRUE")
success=SETENVQQ("OMP_PLACES={0:6},{6:6},{12:6},{18:6},{24:6},{30:6}")

 blocksize=600
 dim=blocksize*NoNUMANodes
 allocate(A(dim,dim))
 allocate(B(dim,dim))
 allocate(C1(dim,dim))
 allocate(C2(dim,dim))
 allocate(C3(dim,dim))
 allocate(C4(dim,dim))
 allocate(C5(dim,dim))
 allocate(C6(dim,dim))
 allocate(C7(dim,dim))
 allocate(C8(dim,dim))
 allocate(tmp(NCPU))
 call KMP_SET_STACKSIZE_S(990000000)
 call omp_set_dynamic(0)
 call omp_set_nested(1)
 
  
   print*,'Starting first Nested region'
   call omp_set_num_threads(NoNUMANodes) !Now outer parallelization over numa nodes
   !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ID)  
   !$OMP DO SCHEDULE(STATIC)
   do i = 1,NoNumanodes
      SELECT CASE (i)
        CASE(1)
          call Products(dim,A,B,C1)
        CASE(2)
          call Products(dim,A,B,C2)
        CASE(3)
          call Products(dim,A,B,C3)
        CASE(4)
          call Products(dim,A,B,C4)
        CASE(5)
          call Products(dim,A,B,C5)
        CASE(6)
          call Products(dim,A,B,C6)
        CASE(7)
          call Products(dim,A,B,C7)
        CASE(8)
          call Products(dim,A,B,C8)
      END SELECT 
   end do
   !$OMP END DO
   !$OMP END PARALLEL  
   print*,'First Nested done '
   
   print*,'Starting single parallel region'
   call omp_set_num_threads(NCPU)   
   !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,j,k)  proc_bind(Spread)
   !$OMP DO SCHEDULE(STATIC)
   do i=1,NCPU
    k=0
    do j=1,1000000000
     k=k+exp((i*1d0))*exp(-(i*1d0))+(j**2)      
    end do
    tmp(i)=k
   end do
   !$OMP END DO
   !$OMP END PARALLEL     
   print*,'Single parallel region done'
   
  
   print*,'Starting second Nested region'
   call omp_set_num_threads(NoNUMANodes) !Now outer parallelization over numa nodes   
   !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i) proc_bind(Spread)
   !$OMP DO SCHEDULE(STATIC)
   do i = 1,NoNumanodes
      SELECT CASE (i)
        CASE(1)
          call Products(dim,A,B,C1)
        CASE(2)
          call Products(dim,A,B,C2)
        CASE(3)
          call Products(dim,A,B,C3)
        CASE(4)
          call Products(dim,A,B,C4)
        CASE(5)
          call Products(dim,A,B,C5)
        CASE(6)
          call Products(dim,A,B,C6)
        CASE(7)
          call Products(dim,A,B,C7)
        CASE(8)
          call Products(dim,A,B,C8)
      END SELECT 
   end do
   !$OMP END DO
   !$OMP END PARALLEL  
   print*,'Second nested region done'
    end program NumaAwareDGEMM
    
 module mTEST
   use omp_lib
    
    contains
    subroutine Products(n,A,B,C)
    implicit none
    real*8,dimension(:,:)  :: A,B,C
    integer :: n
    integer  :: i,j,k,ID

    ID=omp_get_thread_num()
    if (ID.eq.0) then !I do the following just to see that I am able to change the number of nodes on just one NUMAnode.
    call omp_set_num_threads(4) !Inner parallelization
    else 
    call omp_set_num_threads(3) !Inner parallelization    
    end if
    
   !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i) PROC_BIND(close) 
   !$OMP DO SCHEDULE(STATIC)
    do i=1,n
     do j=1,n
      do k=1,n
        C(i,j)=A(i,j)*B(j,k)
      end do
     end do
    end do
   !$OMP END DO
   !$OMP END PARALLEL  
    
    end subroutine Products
    
    
 end module mTEST

So just to make it completely clear what kind of load imbalance problems I'm having I have taken the attached screenshots of the above program running.

Environment.png - this is just the printout from the OMP_DISPLAY_ENV=TRUE

FirstNested.png - This shows how the program initially loads up on the NUMANodes, as expected it runs on 6 numanodes, with 4 threads on the masternode, and 3 threads on all the other numanodes.

SingleParallel.png - Here the program has just switched from the nested region to the singleparallel region (The dip is the switch), as expected this runs with 4 threads on each of the 6 active numanodes.

SecondNested.png - Here we see the problem. When the second nested parallelization starts, the loadbalance is completely off. Even though the parallel region was called with proc_bind(spread) it is clear that the threads somehow does not respect this and instead group up on a few numanodes.

Andrey_C_Intel1 · ‎08-14-2015

Hi Tue B.

Your problem looks like a known bug we've fixed recently, and the fix will be available with new compiler releases coming soon.

As a workaround you can try to set number of threads to value lesser than NoNUMANodes before decreasing it from NCPU to NoNUMEModes, e.g. set it to 1:

   print*,'Starting second Nested region'
   call omp_set_num_threads(1)
   call omp_set_num_threads(NoNUMANodes)

The bug was that we haven't rebound threads in case the team size reduced. Now if you set it to 1, then to NoNumaNodes, the team size will be increased, and threads will be correctly bound to processors.

One more comment: you may set threads locations via OMP_PLACES=cores instead of listing explicit proc numbers, it is easier, more portable, and less error prone. Then OMP_PROC_BIND=spread will bind threads evenly across available resources, and you will get good threads binding in case team size corresponds with number of numa nodes (e.g. 1 thread per node, or N threads per node). Setting the same place (e.g. {0:6}) to multiple threads you allow OS to freely migrate threads between procs 0 - 5, and it may sometimes place several threads to the same proc with poor performance of cause. Given separate core (or hardware thread) to each OpenMP thread you can avoid such problems.

Regards,
Andrey

jimdempseyatthecove · ‎08-14-2015

...
success=SETENVQQ("OMP_PLACES={0:6},{6:6},{12:6},{18:6},{24:6},{30:6}")
...
   print*,'Starting first Nested region'
   call omp_set_num_threads(NoNUMANodes) !Now outer parallelization over numa nodes
   !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ID)

The above is entry to your first parallel region. This region should be set up properly according to your desired scheme.

What is the observed behavior of this first (nested) parallel region?

===============
   print*,'Starting single parallel region'
   call omp_set_num_threads(NCPU)
   !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,j,k) proc_bind(Spread)

The above entry into your second (outer) parallel region is going to bung up the thread team you established for your outer parallel region (it is changing the number of threads and thread placements).

I suggest you test commenting out this parallel region.

   print*,'Starting second Nested region'
   call omp_set_num_threads(NoNUMANodes) !Now outer parallelization over numa nodes
   !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i) proc_bind(Spread)

The above is your 3rd parallel region. When you test with commenting out the second region, also remove/comment "proc_bind(spread)". Your proc placements for the outer region thread team has already been established. Using proc_bind(spread) uses the partition of the "master" thread, which in this case is the main thread outside of all parallel regions... and after entry into and exit from the first parallel region, this thread may have a partition restricted to {0:6}. Therefore "proc_bind(spread)" may now refer to within {0:6}.

Running this variation (commenting out your former second region and removing "proc_bind(spread)" from your third now second region), what is the behavior of the application?

If this works as desired, then you may have to accept the fact that you cannot alter (at least upsize) the number of threads (and placement) for the outermost parallel region. This then means, in order to utilize all threads, that you would be required to structure nested parallel regions.

Jim Dempsey

jimdempseyatthecove · ‎08-14-2015

Andrey,

The flexibility that Tue B seeking is not an unreasonable request. At times it may be advantageous to sub-divide work using nested regions across NUMA node, while at other times use a single non-nested region across all logical processors. Considering this, it may be advantageous to implement an extension to OpenMP.

*** Hypothetical OpenMP extension **

   !$OMP PARALLEL TEAM(ALL) ! all logical processors subject to omp_set_num_threads
!$OMP PARALLEL TEAM(PLACES={0:6},{6:6},{12:6},{18:6},{24:6},{30:6})
   !$OMP PARALLEL TEAM(CORES) ! all cores subject to omp_set_num_threads
   !$OMP PARALLEL TEAM(NUMA) ! all NUMA nodes subject to omp_set_num_threads

You may have objection to using the clause name TEAM, but note it is singular and not the same as TEAMS (plural).

Of course these can be extended for additional flexibility.

Jim Dempsey

jimdempseyatthecove · ‎08-14-2015

Andrey,

After I made my prior post I gave my suggestion some additional thought. What I would like to suggest as an extension would be to "borrow" the thread scheduling annotations as used by my C++ threading toolkit (somewhat defunct but still accessible on my website).

*** CAUTION The following is a hypothetical feature suggestion ***
This is a re-write of Tue B's test program using the hypothetical extension.
The purpose of which is to illustrate the ease and clarity of use:

program NumaAwareDGEMM
 use IFPORT
 use omp_lib
 use mTEST
implicit none

 logical(4) :: Success
 integer :: i,j,k,blocksize,dim,ID
 real*8,allocatable,dimension(:,:) :: A, B
 real*8,allocatable,dimension(:,:,:) :: C
 real*8,allocatable,dimension(:)   :: tmp   

 call KMP_SET_STACKSIZE_S(990000000)
 call omp_set_dynamic(0)
 call omp_set_nested(1)
 success = SETENVQQ("OMP_DISPLAY_ENV=TRUE")
 success=SETENVQQ("KMP_AFFINITY=compact")
 ! Collect some useful topology information
 !$OMP PARALLEL PRIVATE(Node) TEAM(OneEachM0) ! Create a thread team of one thread per NUMA node
   NoNUMANodes=omp_get_num_threads()          ! Save how many NUMA nodes we have.
   MyNode = omp_get_thread_num()              ! Save my NUMA node number (0-based) into TLS
   !$OMP PARALLEL TEAM(M0)                    ! Nested region, each thread creates team of all threads on its NUMA node
     MyThreadOnNode = omp_get_thread_num()    ! Save my 0-based thread number within my NUMA node
     ThreadsOnNode(MyNode) = omp_get_num_threads() ! Set number of threads available on this node into table
   !$OMP END PARALLEL
 !$OMP END PARALLEL
 NCPU = sum(ThreadsOnNode)

 blocksize=600
 dim=blocksize*NoNUMANodes
 allocate(A(dim,dim))
 allocate(B(dim,dim))
 allocate(C(dim,dim,0:NoNUMANodes-1))
 allocate(tmp(NCPU))
  
 print*,'Starting first Nested region'
 !$OMP PARALLEL TEAM(OneEachM0)
   call Products(dim,A,B,C(:,:,MyThreadOnNode))
 !$OMP END PARALLEL  
 print*,'First Nested done '
   
 print*,'Starting single parallel region'
 !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,j,k)
 !$OMP DO SCHEDULE(STATIC)
 do i=1,NCPU
   k=0
   do j=1,1000000000
     k=k+exp((i*1d0))*exp(-(i*1d0))+(j**2)      
   end do
   tmp(i)=k
 end do
 !$OMP END DO
 !$OMP END PARALLEL     
 print*,'Single parallel region done'
   
  
 print*,'Starting second Nested region'
 !$OMP PARALLEL TEAM(OneEachM0)
   call Products(dim,A,B,C(:,:,MyThreadOnNode))
 !$OMP END PARALLEL  
 print*,'Second nested region done'
end program NumaAwareDGEMM
    
module mTEST
  use omp_lib
  integer :: NCPU,NoNumanodes
  integer :: ThreadsOnNode(0:255)   ! 0-based node numbering
  integer :: MyNode, MyThreadOnNode ! useful values
  !$OMP THREADPRIVATE(MyNode, MyThreadOnNode)
    
  contains
  subroutine Products(n,A,B,C)
    implicit none
    real*8,dimension(:,:)  :: A,B,C
    integer :: n
    integer  :: i,j,k,ID

    if (MyNode.eq.0) then
      !I do the following just to see that I am able to change the number of nodes on just one NUMAnode.
      call omp_set_num_threads(ThreadsOnNode(MyNode)-1) !Inner parallelization    
    end if
    
    !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,j,k) TEAM(M0) ! Within my NUMA node 
      !$OMP DO SCHEDULE(STATIC)
      do i=1,n
       do j=1,n
        do k=1,n
          C(i,j)=A(i,j)*B(j,k)
        end do
       end do
      end do
     !$OMP END DO
   !$OMP END PARALLEL  
    
  end subroutine Products
    
    
end module mTEST

*** CAUTION the above code is hypothetical and will not compile nor run as intended ***

Jim Dempsey

jimdempseyatthecove · ‎08-14-2015

By the way, the Products would likely have contained:

    !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,j,k) TEAM(M0) ! Within my NUMA node 
      !$OMP DO SCHEDULE(STATIC)
      do i=1,n
       do j=1,n
        C(i,j)==0.0
        do k=1,n
          C(i,j)=C(i,j) + A(i,j)*B(j,k)
        end do
       end do
      end do
     !$OMP END DO
   !$OMP END PARALLEL

But this was sketch code, not actual code.

In an actual application, the allocation size of C would not be dependent on the number of NUMA nodes, rather the size would be fixed, and partitioned by the number of NUMA nodes.

Jim Dempsey

Tue_B_ · ‎08-17-2015

Andrey Churbanov (Intel) wrote:

Hi Tue B.

Your problem looks like a known bug we've fixed recently, and the fix will be available with new compiler releases coming soon.

As a workaround you can try to set number of threads to value lesser than NoNUMANodes before decreasing it from NCPU to NoNUMEModes, e.g. set it to 1:
   print*,'Starting second Nested region'
   call omp_set_num_threads(1)
   call omp_set_num_threads(NoNUMANodes)
The bug was that we haven't rebound threads in case the team size reduced. Now if you set it to 1, then to NoNumaNodes, the team size will be increased, and threads will be correctly bound to processors.

Thanks Andrey this worked perfectly.

Andrey Churbanov (Intel) wrote:

One more comment: you may set threads locations via OMP_PLACES=cores instead of listing explicit proc numbers, it is easier, more portable, and less error prone. Then OMP_PROC_BIND=spread will bind threads evenly across available resources, and you will get good threads binding in case team size corresponds with number of numa nodes (e.g. 1 thread per node, or N threads per node). Setting the same place (e.g. {0:6}) to multiple threads you allow OS to freely migrate threads between procs 0 - 5, and it may sometimes place several threads to the same proc with poor performance of cause. Given separate core (or hardware thread) to each OpenMP thread you can avoid such problems.

We will probably end up using OMP_PLACES=cores at the end, the reason why I'm writing them explicitly now is due to the fact that I wanted to avoid migration across the hyperthreads, which I guess you can't avoid with OMP_PLACES=cores.

Jim I completely agree that your hypothetical extension of openMP would be very welcome, though with Andrey's fix it is now possible to jump between NUMA and non-numa parallelization which is the most important thing.

Andrey_C_Intel1 · ‎08-18-2015

Jim,

Your idea of setting number of threads based on keywords (and not only digits) sounds interesting. Having a way to query machine topology may have to write more portable code. Current OpenMP specification has some means to get this information (number of processors and number of places), but apparently not enough. We will think on how to help people to get NUMA related information using OpenMP, and how to use this information conveniently. Cannot promise anything in particular though.

Thanks,
Andrey

jimdempseyatthecove · ‎08-25-2015

Andrey,

Take some time (or assign someone to take some time) to download my documentation on www.quickthreadprogramming.com. Pay particular attention to how the annotations and support routines are structured to make thread teaming easy to use and self explanatory.

There are several distinct features added to the QuickThread threading toolkit that could be taken and inserted into an Intel extension of OpenMP (the API can be published for others to implement). I am aware that this is not compliant with OpenMP standards. This said, standards do not generally move until there is a compelling reason to do so. Getting good feedback on feature enhancements is one way to provide the impetus to change.

One of the major differences between the nested parallelism threading in QuickThread and OpenMP is QuickThread uses a single thread pool as compared with OpenMP's nesting of pools and as a consequence having the potential for oversubscription and unnecessary stack consumption. The single QuickThread thread pool has each thread servicing a (their) collection of queues: same thread, same core (same as L1), same L2 (usually equivalent to same core), same L3 (usually same as same socket), same NUMA node, one NUMA distance, two NUMA distances, three NUMA distances. Effectively there is a 2 dimensional queuing system driven off an optional selector.

What may be beneficial to Intel is to invite me in as a consultant to work with your team to integrate into OpenMP the thread pool, queuing system, attribute selection and utility functions (e.g. team barrier). This then could be tested by some of your advanced users (such as Tue B) for comments and to improve integration into OpenMP. The integration should be done for both C++ and Fortran.

Jim Dempsey

John_Campbell · ‎08-26-2015

You may improve your vectorization if you modify Products loop order to:

    !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(j,k) TEAM(M0) ! Within my NUMA node 
      !$OMP DO SCHEDULE(STATIC)
       do j=1,n
        C(:,j)==0.0
        do k=1,n
!      do i=1,n
          C(:,j)=C(:,j) + A(:,j)*B(j,k)
!      end do
        end do
       end do
     !$OMP END DO
   !$OMP END PARALLEL

It may be "sketch code" but attention to localised calculations in the inner loop always helps with memory access bottlenecks associated with OpenMP.

( you could remove further loops to "C(:,j)= A(:,j) * sum(B(j,:))", but I don't think that was the purpose of the example.)

John

OpenMP: How to mix nested and non-nested parallization?