bug in ZGETRF thread spawning?

Tue_B_ · ‎02-02-2016

A week ago I noticed that my code is running at about 50% of the expected speed, and upon closer inspection I found that ZGETRF spawns significantly more threads than intended when called inside a nested region, I can't for sure say whether this is what gives me about half the performance I would expect, but something is definitely off.

Below I have pasted a small example that should illustrate the problem nicely: It is a piece of code that makes an outer parallel region and inside this region you can either call intels DGEMM routine or intels ZGETRF routine, when calling DGEMM the thread count of the program stays at the expected number, whereas when ZGETRF is called the thread count is jumping to way higher numbers (I detect thread count with process Explorer)

program NumaAwareDGEMM
 use IFPORT
 use omp_lib
 use mkl_service
implicit none

 integer :: i,j,k,NCPU,NoNumanodes,dim,success,id,NCPUinner,ij,ND,INFO1,INFO2
 integer,allocatable,dimension(:) :: IPS1,IPS2
 real*8,allocatable,dimension(:,:) :: A, B,C1,c2
 complex*16,allocatable,dimension(:,:) :: A1,A2
 real*8,allocatable,dimension(:,:,:) :: C
 real*8,allocatable,dimension(:)   :: tmp   
real*8 :: tmp_r
 NCPUinner=5
 NoNUMANodes=2                     !How many NUMA nodes to distribute calculations over.
 NCPU=10                          !Number of CPUs to run non-nested parallel regions on.
 success = SETENVQQ("OMP_DISPLAY_ENV=TRUE")
success=SETENVQQ("OMP_PLACES={0:10},{10:10}")

 ND=4500
 dim=ND*NoNUMANodes
 
 !Multiplication allocatables
 allocate(A(dim,dim))
 allocate(B(dim,dim))
 allocate(C1(dim,dim))
 allocate(C2(dim,dim))
 
 !Factorization allocatables
 allocate(A1(ND,ND))
 allocate(A2(ND,ND))
 allocate(IPS1(ND))
 allocate(IPS2(ND))
 
 call KMP_SET_STACKSIZE_S(990000000)
 call omp_set_dynamic(0)
 call mkl_set_dynamic(0)
 call omp_set_nested(1)
 call omp_set_num_threads(NCPU) 
 
    !OpenMP settings are applied when the first parrallel loop is found. We do a dummy loop here to get it done now..
    !$OMP PARALLEL DEFAULT(PRIVATE) SHARED(NCpu) REDUCTION(+:J)
      J=0
    !$OMP DO
      do i=1,NCpu
        J=J+1      
      end do   
    !$OMP END DO
    !$OMP END PARALLEL

   call omp_set_num_threads(1)          !There is a bug in intels openMP implementation that requires the number of threads to be set to 1 before resetting the number of threads
   call omp_set_num_threads(NoNUMANodes) 
  
   !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ID,k,ij,j)  
   !$OMP DO SCHEDULE(STATIC)
   do i = 1,NoNumanodes*4
     k=mod(i,2)
     !Make the matrices for Factorization
     if(k.eq.0) then
     do ij=1,ND
      do j=1,ND
        A1(ij,j)=1d0/(1d0+abs(ij-j))
      end do
     end do
     else
     do ij=1,ND
      do j=1,ND
          A2(ij,j)=1d0/(1d0+abs(ij-j))
      end do
     end do
     end if
     
     
     call mkl_set_num_threads(1)    
     call mkl_set_dynamic(0)
     call mkl_set_num_threads(NCPUInner)  
     SELECT CASE (k)
        CASE(0)
 !         call dgemm('N','N',dim,dim,dim,1.d0,A,dim,B,dim,0.d0,C1,dim)
           CALL ZGETRF(ND,ND, A1, ND, IPS1, INFO1 )
       CASE(1)
 !         call dgemm('N','N',dim,dim,dim,1.d0,A,dim,B,dim,0.d0,C2,dim)
           CALL ZGETRF(ND,ND, A2, ND, IPS2, INFO2 )
        END SELECT
   end do
   
   !$OMP END DO
   !$OMP END PARALLEL  
  
      end program NumaAwareDGEMM

It would be nice if anyone could confirm that there indeed is an issue here, and if the bug is already known a workaround or fix to the problem would be nice.

Cheers

Tue

Gennady_F_Intel · ‎02-02-2016

nested parallelism may effect on thread oversubscription. What the perf result of zgetrf have you expected and actually obtained? pls try to enable mkl verbose mode and see what performance you will have with and without mkl_dynamic.

Tue_B_ · ‎02-02-2016

Gennady Fedorov (Intel) wrote:

nested parallelism may effect on thread oversubscription. What the perf result of zgetrf have you expected and actually obtained? pls try to enable mkl verbose mode and see what performance you will have with and without mkl_dynamic.

If I enable mkl_dynamic MKL routines run in seriel inside a nested region so that isn't really an option. But by watching the thread count of a simple code as the one posted above I see a thread count of over 100 with zgetrf even though I limit myself to 2 numa nodes and 10 mkl_threads on each numa node, this at least in my book should never happen, and if it not a bug I have thus far found no information/warning on this.

TimP · ‎02-02-2016

Does zgetrf comply with current style of setting num_threads for nested parallel, e.g. omp_num_threads=2,10 under OMP_NESTED? I would have expected the same effect for omp_num_threads=2 mkl_num_threads=10 but current OpenMP seems more appealing.

The comments about mkl_dynamic are confusing since the default setting is TRUE. Setting FALSE allows MKL to create a thread for each hyperthread which evidently is not wanted in nested parallelism unless possibly those threads can be confined within the same cores.

Do kmp_affinity=verbose settings allow to see how these threads are placed?

Tue_B_ · ‎02-03-2016

Tim P. wrote:

Does zgetrf comply with current style of setting num_threads for nested parallel, e.g. omp_num_threads=2,10 under OMP_NESTED? I would have expected the same effect for omp_num_threads=2 mkl_num_threads=10 but current OpenMP seems more appealing.

It actually seems to make a difference whether omp_num_threads or mkl_num_threads is used, though but spawn way more threads than they should. With an outer parallelization over 2 threads, and an inner parallelization of 5 threads, I get 46 (+1 monitor thread) threads with MKL and 50 (+1 monitor thread) threads with OMP. On DGEMM I spawn the expected 10 threads (+1 monitor thread) whether I use mkl or omp to set the number of threads.

Tim P. wrote:

The comments about mkl_dynamic are confusing since the default setting is TRUE. Setting FALSE allows MKL to create a thread for each hyperthread which evidently is not wanted in nested parallelism unless possibly those threads can be confined within the same cores.

I agree that setting mkl_dynamic false is not the ideal option, but since setting mkl_dynamics=true restricts all mkl routines to 1 thread inside parallel regions I can't use that. And yes I have managed to confine the threads spawned to the cores on their numa node, so the affinity of the problem should be alright.

Tim P. wrote:

Do kmp_affinity=verbose settings allow to see how these threads are placed?

As far as I can see kmp_affinity=verbose does nothing in this case.