Hi Yiing,

Gheibi__Sanaz · ‎02-22-2018

Hi,

This question is related to another post in this same forum (https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/549769). We are using a code from there with minor modifications. We are facing a different problem; that is why we have decided to pursue it in a new post.

The following code is supposed to bound thread 0 to procs {0,1,...,31} and thread 1 to procs{32, 33, ..., 63} and deactivate all the other threads; however it produces 8 threads and bounds all of them to all the available processors (procs{0,1,...,255}).

The code and the affinity report come in the following. We will really appreciate your help.

The Fortran code:

program NumaAwareDGEMM
use IFPORT
use omp_lib
implicit none
include "mkl_service.fi"

logical(4) :: Success
integer :: numaNodeCount,processorPackageCount,processorCoreCount
integer logicalProcessorCount
integer,dimension(3) :: processorCacheCount
integer :: NoNUMANodes, blocksize,nrepeats,Runmode
integer :: N,I,J,NIte, First,Last,k,colidx,error,numofblocks,iii
integer ii,dim,d,threadID,NumaID
integer :: Iter,Solver,NUMASize,m,ThreadsPrNuma
real*8,allocatable,dimension(:,:) :: A, B,C,rA,rB,rC,cA,cB,cC,bA
real*8,allocatable,dimension(:,:) :: bB,bbc
real*8,allocatable,target :: bC(:,:)
logical, allocatable, dimension(:) :: NumaNodeDone

processorPackageCount = 2
logicalProcessorCount = 32
success=SETENVQQ("OMP_PLACES={0:32},{32:32}")

!Create dummy matrices - we just matmul them uninitialized
blocksize=100
NoNUMANodes=2                     !How many NUMA nodes to distribute calculations over
ThreadsPrNuma=4                   !How many threads to use pr. numa node
dim=blocksize*NoNUMANodes
allocate(bA(dim,dim))
allocate(bB(dim,dim))
allocate(bC(dim,dim))
allocate(bbc(blocksize,blocksize))

!we spawn only one thread per package using the spread processor binding and call the mkl for each of these threads.
call KMP_SET_STACKSIZE_S(990000)
call omp_set_dynamic(0)
call mkl_set_dynamic(0)
call omp_set_nested(1)

!Outer parallel region. This region only does work for the first thread on a processor package.
call omp_set_num_threads(NoNumaNodes)
print *, NoNumaNodes
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(ii,threadID,NumaID)
      threadID=omp_get_thread_num()
      print *,'Thread binding for socket=',threadID
      ii=mkl_set_num_threads_local(ThreadsPrNuma)
      if(threadID == 0) then
           success = SETENVQQ("OMP_PLACES=0:32")
      else if(threadID == 1) then
           success = SETENVQQ("OMP_PLACES=32:32")
      else
           stop
      end if
      call dgemm('N','N',blocksize,blocksize,blocksize,1.d0,bA,dim,
       >             bB,dim,0.d0,bbc,blocksize)
 !$OMP END PARALLEL
 end program NumaAwareDGEMM

The affinity report:

OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255}
OMP: Info #156: KMP_AFFINITY: 256 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 64 cores/pkg x 4 threads/core (64 total cores)
OMP: Info #247: KMP_AFFINITY: pid 81056 tid 81056 thread 0 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255}
OMP: Info #247: KMP_AFFINITY: pid 81056 tid 81058 thread 1 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255}
OMP: Info #247: KMP_AFFINITY: pid 81056 tid 81059 thread 2 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255}
OMP: Info #247: KMP_AFFINITY: pid 81056 tid 81061 thread 4 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255}
OMP: Info #247: KMP_AFFINITY: pid 81056 tid 81060 thread 3 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255}
OMP: Info #247: KMP_AFFINITY: pid 81056 tid 81062 thread 5 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255}
OMP: Info #247: KMP_AFFINITY: pid 81056 tid 81063 thread 6 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255}
OMP: Info #247: KMP_AFFINITY: pid 81056 tid 81064 thread 7 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255}

Thank you again,

Sanaz

Ying_H_Intel · ‎02-22-2018

Hi Sanaz,

What is your compiler/mkl and their version? Could you please tell why you want to control these threads specially? Not sure if you know, in the latest version, there is some function to support batched dgemm, like https://software.intel.com/en-us/mkl-developer-reference-fortran-gemm-batch, which have take care of multi-threads issues . So if possible, i may recommend you to try the batched gemm.

I haven't tied the code, but can explain the reason of 8 thread because you have external threads

call omp_set_num_threads(NoNumaNodes) 2

and mkl internal threads

mkl_set_num_threads_local(ThreadsPrNuma) 4

So total 2x4.
and about the affinity , how about the original code in that forum, does it work on your machine

Then consider to modify your code, like remove line 22, move the line 46 before line 54 etc. ?

And please submit the issue to https://supporttickets.intel.com/?lang=en-US, so pass your private information

Best Regards,

Ying

Gheibi__Sanaz · ‎02-23-2018

Hi Ying,

Thank you very much for your help. We are using ifortran as our compiler. I will check the versions as soon as I can and post them here. Also, we are doing all this because we want to do a large scale matrix multiplication on KNL. We are essentially breaking the matrix into submatrices and performing multiplications on each of those submatrices using parallel MKL DGEMM calls. We want the threads used by each DGEMM to be located near each other, so they could efficiently use the shared resources. On the other hand, we want the thread pools corresponding to each of the DGEMM calls to be as far from each other as possible, that is because there will be more efficient utilization of cache and memory bandwidth. That is the reason why we try to locate threads 0 and 1 on different proc sets and deactivate the other threads. We want the other threads to be used later by MKL calls.

Your explanation about the total number of threads makes total sense. Thank you very much for that.

Also, thank you very much about your suggestions regarding affinity. We will try them as soon as we can and if there are further problems, we will post them here again.

Best,

Sanaz

Gheibi__Sanaz · ‎02-26-2018

Hi again Ying,

Thank you again for your valuable help. We did as you suggested, (removed line 22 and placed line 46 before line 54) but the result didn't change at all. Is there anything else we could do? Or is there any debugging tool, etc. that helps us identify the exact source of problem?

Also, about the batched dgemm, it doesn't fit our purpose since we are doing the opposite thing. I.e. instead of grouping small matrices together, we are trying to break a large one and operate on its segments in a parallel way.

Best Regards,

Sanaz

TimP · ‎02-26-2018

If that is your intent, it seems you should use the parallelised gemm or opt_matmul provided with ifort . By the way, in such context, the legacy non-standard real*8 is not necessarily the same as real(real64) or real(selected_real_kind(12) so it seems you take unnecessary chances.

Ying_H_Intel · ‎02-26-2018

Hi Sanaz, Tim
thank for your reply. Right, most of us want to cpu affinity because the reason as yours:

we are doing all this because we want to do a large scale matrix multiplication on KNL. We are essentially breaking the matrix into submatrices and performing multiplications on each of those submatrices using parallel MKL DGEMM calls. We want the threads used by each DGEMM to be located near each other, so they could efficiently use the shared resources. On the other hand, we want the thread pools corresponding to each of the DGEMM calls to be as far from each other as possible, that is because there will be more efficient utilization of cache and memory bandwidth.
and mkl dgemm take consider the factors for best performance too. So that is why we recommend to use opt_matmul provide with ifort or DGEMM directly, instead of control the threads your self if your case is allowed

for example. could you tell the exact number of large scale matrix multiplication? assume you had L Amxk * Bkxn , do you mean the L is large , or m, k,n is large? If L is large, you may consider the batched_gemm directly. if m, k or n is large, then dgemm should be fine.

Best Regards,

Ying

Gheibi__Sanaz · ‎02-28-2018

Thank you very much Tim and Ying,

Actually in our case, m, n, k are large, but L is not large. For example in order to multiply two 32K * 32K matrices, we perform L=64 matrix multiplications involving matrices having sizes 8K * 8K. It is possible to have blocks smaller than 8K * 8K, but since we have a lot of cores available on KNL to do a dgemm, we would like to have a use case with larger block size and fewer matrices.

Thank you again,

Sanaz

Ying_H_Intel · ‎02-28-2018

Hi Sanaz,

Nice to know the big enough matrix computation :). You may refer to mkl benchmark, https://software.intel.com/en-us/mkl/features/benchmarks. Using dgemm should fine to utilize the KNL core resources efficiently .

Best Regards,

Ying

Gheibi__Sanaz · ‎03-01-2018

Thank you very much Ying. I will try it and hopefully it works.

Thank you again,

Sanaz

Bounding threads to processors