- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A week ago I noticed that my code is running at about 50% of the expected speed, and upon closer inspection I found that ZGETRF spawns significantly more threads than intended when called inside a nested region, I can't for sure say whether this is what gives me about half the performance I would expect, but something is definitely off.
Below I have pasted a small example that should illustrate the problem nicely: It is a piece of code that makes an outer parallel region and inside this region you can either call intels DGEMM routine or intels ZGETRF routine, when calling DGEMM the thread count of the program stays at the expected number, whereas when ZGETRF is called the thread count is jumping to way higher numbers (I detect thread count with process Explorer)
program NumaAwareDGEMM use IFPORT use omp_lib use mkl_service implicit none integer :: i,j,k,NCPU,NoNumanodes,dim,success,id,NCPUinner,ij,ND,INFO1,INFO2 integer,allocatable,dimension(:) :: IPS1,IPS2 real*8,allocatable,dimension(:,:) :: A, B,C1,c2 complex*16,allocatable,dimension(:,:) :: A1,A2 real*8,allocatable,dimension(:,:,:) :: C real*8,allocatable,dimension(:) :: tmp real*8 :: tmp_r NCPUinner=5 NoNUMANodes=2 !How many NUMA nodes to distribute calculations over. NCPU=10 !Number of CPUs to run non-nested parallel regions on. success = SETENVQQ("OMP_DISPLAY_ENV=TRUE") success=SETENVQQ("OMP_PLACES={0:10},{10:10}") ND=4500 dim=ND*NoNUMANodes !Multiplication allocatables allocate(A(dim,dim)) allocate(B(dim,dim)) allocate(C1(dim,dim)) allocate(C2(dim,dim)) !Factorization allocatables allocate(A1(ND,ND)) allocate(A2(ND,ND)) allocate(IPS1(ND)) allocate(IPS2(ND)) call KMP_SET_STACKSIZE_S(990000000) call omp_set_dynamic(0) call mkl_set_dynamic(0) call omp_set_nested(1) call omp_set_num_threads(NCPU) !OpenMP settings are applied when the first parrallel loop is found. We do a dummy loop here to get it done now.. !$OMP PARALLEL DEFAULT(PRIVATE) SHARED(NCpu) REDUCTION(+:J) J=0 !$OMP DO do i=1,NCpu J=J+1 end do !$OMP END DO !$OMP END PARALLEL call omp_set_num_threads(1) !There is a bug in intels openMP implementation that requires the number of threads to be set to 1 before resetting the number of threads call omp_set_num_threads(NoNUMANodes) !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ID,k,ij,j) !$OMP DO SCHEDULE(STATIC) do i = 1,NoNumanodes*4 k=mod(i,2) !Make the matrices for Factorization if(k.eq.0) then do ij=1,ND do j=1,ND A1(ij,j)=1d0/(1d0+abs(ij-j)) end do end do else do ij=1,ND do j=1,ND A2(ij,j)=1d0/(1d0+abs(ij-j)) end do end do end if call mkl_set_num_threads(1) call mkl_set_dynamic(0) call mkl_set_num_threads(NCPUInner) SELECT CASE (k) CASE(0) ! call dgemm('N','N',dim,dim,dim,1.d0,A,dim,B,dim,0.d0,C1,dim) CALL ZGETRF(ND,ND, A1, ND, IPS1, INFO1 ) CASE(1) ! call dgemm('N','N',dim,dim,dim,1.d0,A,dim,B,dim,0.d0,C2,dim) CALL ZGETRF(ND,ND, A2, ND, IPS2, INFO2 ) END SELECT end do !$OMP END DO !$OMP END PARALLEL end program NumaAwareDGEMM
It would be nice if anyone could confirm that there indeed is an issue here, and if the bug is already known a workaround or fix to the problem would be nice.
Cheers
Tue
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
nested parallelism may effect on thread oversubscription. What the perf result of zgetrf have you expected and actually obtained? pls try to enable mkl verbose mode and see what performance you will have with and without mkl_dynamic.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Gennady Fedorov (Intel) wrote:
nested parallelism may effect on thread oversubscription. What the perf result of zgetrf have you expected and actually obtained? pls try to enable mkl verbose mode and see what performance you will have with and without mkl_dynamic.
If I enable mkl_dynamic MKL routines run in seriel inside a nested region so that isn't really an option. But by watching the thread count of a simple code as the one posted above I see a thread count of over 100 with zgetrf even though I limit myself to 2 numa nodes and 10 mkl_threads on each numa node, this at least in my book should never happen, and if it not a bug I have thus far found no information/warning on this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Does zgetrf comply with current style of setting num_threads for nested parallel, e.g. omp_num_threads=2,10 under OMP_NESTED? I would have expected the same effect for omp_num_threads=2 mkl_num_threads=10 but current OpenMP seems more appealing.
The comments about mkl_dynamic are confusing since the default setting is TRUE. Setting FALSE allows MKL to create a thread for each hyperthread which evidently is not wanted in nested parallelism unless possibly those threads can be confined within the same cores.
Do kmp_affinity=verbose settings allow to see how these threads are placed?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim P. wrote:
Does zgetrf comply with current style of setting num_threads for nested parallel, e.g. omp_num_threads=2,10 under OMP_NESTED? I would have expected the same effect for omp_num_threads=2 mkl_num_threads=10 but current OpenMP seems more appealing.
It actually seems to make a difference whether omp_num_threads or mkl_num_threads is used, though but spawn way more threads than they should. With an outer parallelization over 2 threads, and an inner parallelization of 5 threads, I get 46 (+1 monitor thread) threads with MKL and 50 (+1 monitor thread) threads with OMP. On DGEMM I spawn the expected 10 threads (+1 monitor thread) whether I use mkl or omp to set the number of threads.
Tim P. wrote:
The comments about mkl_dynamic are confusing since the default setting is TRUE. Setting FALSE allows MKL to create a thread for each hyperthread which evidently is not wanted in nested parallelism unless possibly those threads can be confined within the same cores.
Tim P. wrote:
Do kmp_affinity=verbose settings allow to see how these threads are placed?
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page