Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
6981 Discussions

How to set affinity of threads spawned by MKL?

Tue_B_
Novice
5,283 Views

I'm trying to do some optimization on a NUMA system at the moment and have run into trouble with regards to thread affinity.

The essence of the code I'm using can be seen below:

      success = SETENVQQ("OMP_PLACES={0,1,2,3,4,5,6,7},{8,9,10,11,12,13,14,15},{16,17,18,19,20,21,22,23},{24,25,26,27,28,29,30,31},{32,33,34,35,36,37,38,39},{40,41,42,43,44,45,46,47},{48,49,50,51,52,53,54,55},{56,57,58,59,60,61,62,63}")
      NUMA=4
      ThreadsPrNuma=2
      call omp_set_dynamic(0)
      call mkl_set_dynamic(0)
      call omp_set_nested(1)
      call omp_set_num_threads(NUMA)
      call mkl_set_num_threads(NUMA)
      !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(k) proc_bind(spread)    
      !$OMP DO SCHEDULE(STATIC)
      do k=1,NUMA
        !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i) proc_bind(master)
        !$OMP END PARALLEL          
        call omp_set_num_threads(ThreadsPrNuma)
        call mkl_set_num_threads(ThreadsPrNuma)
        call MKL_routine_to_run_parallel_on_each_NUMA_node()
      end do
      !$OMP END DO
      !$OMP END PARALLEL

So the idea is that I have I want to do an outer paralization over each NUMA node - proc_bind(spread).

And then I want each of these threads to have an inner paralization over the number of processors in that NUMA node, the problem is that the MKL routine doesn't seem to be respecting the proc_bind(master) I set.

Anyone got any idea how to do such a thing?

Cheers

Tue

0 Kudos
35 Replies
Tue_B_
Novice
1,937 Views

Hey Jim, what you are suggesting is exactly what I have been doing in the last couple of hours and here is the result.

It works!

For some unknown reason a mkl pool on the master thread does not respect the affinity setting, but they do on all the other threads.

Here is a code example of something running with correct affinity even after the initialization.

Once I get a complete numa-aware dgemm matrix multiplication up and running I will post it along with some performance results.

Cheers!

 

   program NumaAwareDGEMM
   use IFPORT
   use omp_lib
   use mkl_service
   implicit none

   logical(4) :: Success
   integer :: numaNodeCount,processorPackageCount,processorCoreCount,logicalProcessorCount
   integer,dimension(3) :: processorCacheCount   
   integer :: NoNUMANodes, blocksize,nrepeats,Runmode,t0
   integer :: N,I,J,NIte, First,Last,k,colidx,error,numofblocks,iii,ii,dim,d,threadID,NumaID
   integer :: Iter,Solver,NUMASize,m,ThreadsPrNuma
   integer, allocatable,dimension(:) :: GlobalThreadID
   real*8,allocatable,dimension(:,:) :: A, B,C,rA,rB,rC,cA,cB,cC,bA,bB,bbc
   real*8,allocatable,target :: bC(:,:)
   logical, allocatable, dimension(:) :: NumaNodeDone,MKlbusy
      
   !Set the mode to test
   call GetMachineArch(numaNodeCount,processorPackageCount,d,logicalProcessorCount,processorCacheCount ) 
   NoNUMANodes=9                     !How many NUMA nodes to distribute calculations over
   SELECT CASE (NoNUMANodes)
   CASE(1)
   success=SETENVQQ("OMP_PLACES={0:8}")
   CASE(2)
   success=SETENVQQ("OMP_PLACES={0:8},{8:8}")
   CASE(3)
   success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8}")
   CASE(4)
   success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8}")
   CASE(5)
   success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8}")
   CASE(6)
   success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8},{40:8}")
   CASE(7)
   success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8},{40:8},{48:8}")
   CASE(8)
   success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8},{40:8},{48:8},{56:8}")
   CASE(9)
   success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8},{40:8},{48:8},{56:8}")
   END SELECT       
   
   !Create dummy matrices - we just matmul them uninitialized  
   blocksize=6000
   ThreadsPrNuma=4                   !How many threads to use pr. numa node 
   dim=blocksize*NoNUMANodes
   allocate(bA(dim,dim))
   allocate(bB(dim,dim))
   allocate(bC(dim,dim))
   allocate(bbc(blocksize,blocksize))
   allocate(GlobalThreadID(NoNumanodes*ThreadsPrNuma))
   allocate(Mklbusy(NoNumanodes))
   call KMP_SET_STACKSIZE_S(990000000)
   call omp_set_dynamic(0)
   call mkl_set_dynamic(0)
   call omp_set_nested(1)

     !intialization region
     call omp_set_num_threads(NoNumaNodes)
     !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ii,threadID,NumaID)
     !$OMP DO SCHEDULE(STATIC)
     do i = 1,NoNumanodes
        GlobalThreadID(i)=omp_get_thread_num()
        print *,'Thread binding for socket=',GlobalThreadID(i)
        SELECT CASE (GlobalThreadID(i))
        CASE(1)
        success=SETENVQQ("OMP_PLACES={0:8}")
        CASE(2)
        success=SETENVQQ("OMP_PLACES={8:8}")
        CASE(3)
        success=SETENVQQ("OMP_PLACES={16:8}")
        CASE(4)
        success=SETENVQQ("OMP_PLACES={24:8}")
        CASE(5)
        success=SETENVQQ("OMP_PLACES={32:8}")
        CASE(6)
        success=SETENVQQ("OMP_PLACES={40:8}")
        CASE(7)
        success=SETENVQQ("OMP_PLACES={48:8}")
        CASE(8)
        success=SETENVQQ("OMP_PLACES={56:8}")
        END SELECT 
        if(GlobalThreadID(i).gt.0) then
            call KMP_SET_BLOCKTIME(0)
            ii=mkl_set_num_threads_local(ThreadsPrNuma)
            call dgemm('N','N',blocksize,blocksize,blocksize,1.d0,bA,dim,bB,dim,0.d0,bbc,blocksize)
        end if
     end do
     !$OMP END DO
     !$OMP END PARALLEL  
     
     MKLbusy = .true. ! 
     !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ii,threadID,NumaID) ! all threads scheduled 
     !$OMP DO SCHEDULE(STATIC)
     do i = 1,NoNumanodes
        threadID=omp_get_thread_num()
        print *,'Thread binding for socket=',GlobalThreadID(i),threadID
        if((threadID.gt.0)) then
            ! calculate tile for thread 0
            call dgemm('N','N',blocksize,blocksize,blocksize,1.d0,bA,dim,bB,dim,0.d0,bbc,blocksize)
        end if 
     end do
     !$OMP END DO
     !$OMP END PARALLEL  
    end program NumaAwareDGEMM

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,937 Views

Tue,

You (we) managed to do something that a lot of other people had attempted to do, but for lack of documentation (or understanding on the internals of the relationship of MKL and OpemMP on the host program).

Steve L., If you've been following this thread, see if you can get Tue B.'s sample program above into the MKL User and Reference manual, with a good explanation of why this works and the nuances of using it.

Steve, Tue

Your example code above still needs a little more work.

Your line 93 !$OMP DO SCHEDULE(STATIC) will look at the following "do i=1,NoNumaNodes" and create a team of NoNumaNodes threads (0, 1, ... NoNumaNodes-1). This includes the process main thread as team member number 0, which you exclude with your "if(hreadID.gt.0))", but this is also excluding the parallel region's team member number NoNumaNodes which is now processing Numa Node 0's slice of the work.

  MKLbusy = .true. ! 
  !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ii,threadID,NumaID) ! all threads scheduled 
  !$OMP DO SCHEDULE(STATIC)
  do i = 1,NoNumanodes+1
     threadID=omp_get_thread_num()
     if((threadID.gt.0)) then
         ! calculate tile for Numa Node threadID-1
         call dgemm('N','N',blocksize,blocksize,blocksize,1.d0,bA,dim,bB,dim,0.d0,bbc,blocksize)
     end if 
  end do
  !$OMP END DO
  !$OMP END PARALLEL  

I strongly suggest for the example code to go into the MKL reference guide, that the above sample is reworked to actually partition the working array. A version along the line of:

module Partition_module
type Partition_t
  ! sufficient number of variables to specify
  ! the entire workspace (outside of parallel region)
  ! as well as a subsection for use by thread inside parallel region
  ...
end type Partition_t
...
end module Partition_module

  use Partition_module
  type(Partition_t) :: masterPartiton
  type(Partition_t), allocatable :: NumaNodePartition(:) ! allocated 1:NoNumaNodes
  ! initialize MKL pools
  ...
  ! Initialize masterPartition
  ...
  allocate(NumaNodePartition(NoNumaNodes)) ! 1:NoNumaNodes
  ! Using masterPartiton and NoNumaNodes, initialize each node partiton
  ...
  ! use code
  MKLbusy = .true. ! 
  !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ii,threadID,NumaID) ! all threads scheduled 
  threadID=omp_get_thread_num()
  if((threadID.gt.0).and.(threadID.le.NoNumaNodes) then
     call YourShell_dgemm(NumaNodePartition(threadID))
     MKLbusy(threadID) = .false.
  else
     do while(any(MKLbusy))
       SleepQQ(0)
     end do
  endif
  !$OMP END PARALLEL  

...
subroutine YourShell_dgemm(Partition)
  use Partition_module
  type(Partition_t) :: Partiton
  ...
  call dgemm('N','N', &
             & Partition%blocksize, &
             & Partition%blocksize, &
             & Partition%blocksize, &
             & 1.d0, &
             & Partition%bA, &
             & Partition%dim, &
             & Partition%bB, &
             & Partition%dim, &
             & 0.d0, &
             & Partition%bbc, &
             & Partition%blocksize)
end subroutine YourShell_dgemm 

Jim Dempsey

 

0 Kudos
Steven_L_Intel1
Employee
1,937 Views

I am going to move this to the MKL forum so the MKL experts there can look at it.

0 Kudos
Steven_L_Intel1
Employee
1,937 Views

I did send a request to the MKL documentation writers to look at this thread to see what they could use from it.

0 Kudos
Tue_B_
Novice
1,937 Views

jimdempseyatthecove wrote:

Your line 93 !$OMP DO SCHEDULE(STATIC) will look at the following "do i=1,NoNumaNodes" and create a team of NoNumaNodes threads (0, 1, ... NoNumaNodes-1). This includes the process main thread as team member number 0, which you exclude with your "if(hreadID.gt.0))", but this is also excluding the parallel region's team member number NoNumaNodes which is now processing Numa Node 0's slice of the work.

 

I'm not entirely sure I understand you correctly here, but what I think you are saying is the following:

If I do 

  do i = 1,NoNumanodes+1
        threadID=omp_get_thread_num()
        if((threadID.gt.0)) then
            call dgemm('N','N',blocksize,blocksize,blocksize,1.d0,bA,dim,bB,dim,0.d0,bbc,blocksize)
        end if 
     end do

then you are afraid that if((threadID.gt.0)) will not only exclude the master thread but also the last thread, which is meant to do the work instead of the master thread. Is that correct?

Cause if it is then that is not the case, the last thread gets the threadID equal to Nonumanodes and not 0.

That said there were a few things which were not too pretty in the last bit of code I posted, so here is a cleaner version of the last one once again. In the previous one I just increased Nonumanodes to 9 when I wanted to run 8 numanodes, in this one this is no longer the case.

I have added an image of the CPU load, to show that it does indeed distribute the load correctly. (the reason why node 7 is a bit higher is due to a process already running in the background)

  program NumaAwareDGEMM
   use IFPORT
   use omp_lib
   use mkl_service
   implicit none

   logical(4) :: Success
   integer :: numaNodeCount,processorPackageCount,processorCoreCount,logicalProcessorCount
   integer,dimension(3) :: processorCacheCount   
   integer :: NoNUMANodes, blocksize,nrepeats,Runmode,t0
   integer :: N,I,J,NIte, First,Last,k,colidx,error,numofblocks,iii,ii,dim,d,threadID,NumaID
   integer :: Iter,Solver,NUMASize,m,ThreadsPrNuma
   integer, allocatable,dimension(:) :: GlobalThreadID
   real*8,allocatable,dimension(:,:) :: A, B,C,rA,rB,rC,cA,cB,cC,bA,bB,bbc
   real*8,allocatable,target :: bC(:,:)
   logical, allocatable, dimension(:) :: NumaNodeDone,MKlbusy
      
   call GetMachineArch(numaNodeCount,processorPackageCount,d,logicalProcessorCount,processorCacheCount ) 
   
   NoNUMANodes=8                     !How many NUMA nodes to distribute calculations over
   ThreadsPrNuma=4                   !How many threads to use pr. numa node 
   
   SELECT CASE (NoNUMANodes) !In this case we have 8 logical processors on each NUMA node.
   CASE(1)
   success=SETENVQQ("OMP_PLACES={0:8}") 
   CASE(2)
   success=SETENVQQ("OMP_PLACES={0:8},{8:8}")
   CASE(3)
   success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8}")
   CASE(4)
   success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8}")
   CASE(5)
   success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8}")
   CASE(6)
   success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8},{40:8}")
   CASE(7)
   success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8},{40:8},{48:8}")
   CASE(8)
   success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8},{40:8},{48:8},{56:8}")
   END SELECT       
   
   blocksize=6000
   dim=blocksize*NoNUMANodes
   allocate(bA(dim,dim))
   allocate(bB(dim,dim))
   allocate(bC(dim,dim))
   allocate(bbc(blocksize,blocksize))
   allocate(GlobalThreadID(NoNumanodes+1))
   call KMP_SET_STACKSIZE_S(990000000)
   call omp_set_dynamic(0)
   call mkl_set_dynamic(0)
   call omp_set_nested(1)

     !intialization region
     call omp_set_num_threads(NoNumaNodes+1)
     !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ii)
     !$OMP DO SCHEDULE(STATIC)
     do i = 1,NoNumanodes+1
        GlobalThreadID(i)=omp_get_thread_num()
        print *,'Thread binding for socket=',GlobalThreadID(i)
        SELECT CASE (GlobalThreadID(i))
        CASE(1)
        success=SETENVQQ("OMP_PLACES={0:8}")
        CASE(2)
        success=SETENVQQ("OMP_PLACES={8:8}")
        CASE(3)
        success=SETENVQQ("OMP_PLACES={16:8}")
        CASE(4)
        success=SETENVQQ("OMP_PLACES={24:8}")
        CASE(5)
        success=SETENVQQ("OMP_PLACES={32:8}")
        CASE(6)
        success=SETENVQQ("OMP_PLACES={40:8}")
        CASE(7)
        success=SETENVQQ("OMP_PLACES={48:8}")
        CASE(8)
        success=SETENVQQ("OMP_PLACES={56:8}")
        END SELECT 
        if(GlobalThreadID(i).gt.0) then
            call KMP_SET_BLOCKTIME(0)
            ii=mkl_set_num_threads_local(ThreadsPrNuma)
            call dgemm('N','N',blocksize,blocksize,blocksize,1.d0,bA,dim,bB,dim,0.d0,bbc,blocksize)
        end if
     end do
     !$OMP END DO
     !$OMP END PARALLEL  
     
     !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ii,threadID)  
     !$OMP DO SCHEDULE(STATIC)
     do i = 1,NoNumanodes+1
        threadID=omp_get_thread_num()
        print *,'Thread binding for socket=',GlobalThreadID(i),threadID
        if((threadID.gt.0)) then
            call dgemm('N','N',blocksize,blocksize,blocksize,1.d0,bA,dim,bB,dim,0.d0,bbc,blocksize)
        end if 
     end do
     !$OMP END DO
     !$OMP END PARALLEL  
    end program NumaAwareDGEMM

 

 

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,937 Views

Good screen shot proof of pudding. Now for an eye opener

time0 = omp_get_wtime()
do j=1,10   
   !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ii,threadID)  
   !$OMP DO SCHEDULE(STATIC)
   do i = 1,NoNumanodes+1
      threadID=omp_get_thread_num()
      print *,'Thread binding for socket=',GlobalThreadID(i),threadID
      if((threadID.gt.0)) then
          call dgemm('N','N',blocksize,blocksize,blocksize,1.d0,bA,dim,bB,dim,0.d0,bbc,blocksize)
      else
        SLEEPQQ(0) ! make one run with this, a second run with this commented out
      end if 
   end do
   !$OMP END DO
   !$OMP END PARALLEL  
end do
print *, "time=", omp_get_wtime() - time0

Then experiment with the MKLbusy flags

time0 = omp_get_wtime()
MKLbusy = .true.
do j=1,10   
   !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ii,threadID)  
   !$OMP DO SCHEDULE(STATIC)
   do i = 1,NoNumanodes+1
      threadID=omp_get_thread_num()
      print *,'Thread binding for socket=',GlobalThreadID(i),threadID
      if((threadID.gt.0)) then
          call dgemm('N','N',blocksize,blocksize,blocksize,1.d0,bA,dim,bB,dim,0.d0,bbc,blocksize)
          MKLbusy(threadID) = .false.
      else
        MKLbusy(threadID) = .false.
        do while(any(MKLbusy))
          SLEEPQQ(0) ! make one run with this, a second run with this commented out
        end do
      end if 
   end do
   !$OMP END DO
   !$OMP END PARALLEL  
end do
print *, "time=", omp_get_wtime() - time0

The MKLbusy code above relieves the SpinWait time of master thread from NUMA node 0 (default ~200ms for that thread). With the do j=1,10 loop, you could see as much as 2 seconds difference in performance.

In a formal program you may want to use NoNumaNodes number of threads in parallel regions that are not using MKL. In this case you would want to use

if((threadID.gt.0).and.(threadID.le.NoNumaNodes) then

as MKL filter.

Jim Dempsey

0 Kudos
Tue_B_
Novice
1,937 Views

I tried playing around with the spintime, but didn't get much performance out of that.

I have also tested the numa-aware dgemm compared to the normal one and what I find is the following:

On a matrix with 16000 rows/columns

with a blocksize of 1000

and 8 numa nodes

with 3 threads pr numanode

I get around 200 GFlops.

This was the maximum speed I was able to get, even though each numanode should have 8 logical processors, but when I increased threads pr numanode above 3 I got slower speeds.

 

However the same matrix just solved in one block, distributed across the memory, I was able to get 340 GFlops, and it was able to utilize 7 threads pr numanode.

 

So it seems to me that the dgemm routine has some kind of numa-aware built into it already, otherwise I really don't see how such speeds are possible.

 

Next up I will do tests with numa-aware paradiso.

0 Kudos
Steven_L_Intel1
Employee
1,937 Views

I got a rather interesting response from the MKL development team:

Control over thread affinity is customer responsibility and MKL does not change the affinity setting in any way. Another observation is that whole exercise of building NUMA-aware BLAS from MKL is pointless as MKL is aware of NUMA architecture and data locality. 

0 Kudos
Tue_B_
Novice
1,937 Views

Steve Lionel (Intel) wrote:

I got a rather interesting response from the MKL development team:

Control over thread affinity is customer responsibility and MKL does not change the affinity setting in any way. Another observation is that whole exercise of building NUMA-aware BLAS from MKL is pointless as MKL is aware of NUMA architecture and data locality. 

Now that is very interesting! The last I heard was that MKL was NOT NUMA-aware, and when I search around on intel MKL page or through google I am still not able to find anything that says MKL is NUMA-aware. The closest I can find is this article:

https://software.intel.com/en-us/articles/getting-high-performance-on-numa-based-nehalem-ex-system-with-mkl-without-controlling-numa 

which goes into the performance issue of MKL with NUMA, but also says the following:

Windows doesn't happen to provide the same sort of utility, so the performance on Windows is poor.

But that is great news if MKL is NUMA aware! Do you know whether they have any kind of documentation about how one should approach and use the MKL on NUMA systems?

On that same note... I have moved on to trying to paralize pardiso, but so far no success. I feel like this is a different issue I'm encountering here since I simple can't get pardiso to use anymore than 1 thread pr pardiso

     !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ii)  
      !$OMP DO SCHEDULE(STATIC)
       do i = 1,2
         if (i.eq.1) then 
          ii=mkl_set_num_threads_local(3)
          CALL pardiso (PT, maxfct, mnum, mtype, phase, sH%NoRows, sH%Vals, sH%RowIdxs, sH%ColIdxs, idum, nrhs, iparm, msglvl, myov, x1, error)
         else if (i.eq.2) then
           ii=mkl_set_num_threads_local(3)
           CALL pardiso (PT, maxfct, mnum, mtype, phase, sH%NoRows, sH%Vals, sH%RowIdxs, sH%ColIdxs, idum, nrhs, iparm, msglvl, myov, x2, error)
         end if
        end do
     !$OMP END DO    
     !$OMP END PARALLEL

So the idea is that the outer loop will split across the NUMA nodes, and mkl_set_num_threads_local(3) will then make pardiso spawn 3 threads on each NUMA-node to run pardiso on. The problem is that right now I just get 2 spawns of pardiso, but each spawn only runs on 1 thread, thus completely disregarding the mkl_set_num_threads_local(3)...

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,937 Views

On a single large matrix multiply, letting MKL figure things out is generally best. This said, if your program has multiple batches (frames), then you might find it best to pipeline these to the different nodes.

Also note the main benefit of NUMA aware programming is principally obtained by assuring the memory used resides on the NUMA node doing the processing. Although temp arrays (internal to MKL) may get allocated on a local node, the input and output arrays are generally allocated in the main thread and subsequent to allocation are "first touch" mapped to the first touching node. This requires that your first touching has to be performed on the nodes you want for use on later calls.

Jim Dempsey

0 Kudos
Alexander_K_Intel2
1,937 Views

Tue B. wrote:

Quote:

Steve Lionel (Intel) wrote:

 

I got a rather interesting response from the MKL development team:

Control over thread affinity is customer responsibility and MKL does not change the affinity setting in any way. Another observation is that whole exercise of building NUMA-aware BLAS from MKL is pointless as MKL is aware of NUMA architecture and data locality.

 

 

Now that is very interesting! The last I heard was that MKL was NOT NUMA-aware, and when I search around on intel MKL page or through google I am still not able to find anything that says MKL is NUMA-aware. The closest I can find is this article:

https://software.intel.com/en-us/articles/getting-high-performance-on-nu...

which goes into the performance issue of MKL with NUMA, but also says the following:

Windows doesn't happen to provide the same sort of utility, so the performance on Windows is poor.

But that is great news if MKL is NUMA aware! Do you know whether they have any kind of documentation about how one should approach and use the MKL on NUMA systems?

On that same note... I have moved on to trying to paralize pardiso, but so far no success. I feel like this is a different issue I'm encountering here since I simple can't get pardiso to use anymore than 1 thread pr pardiso

     !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ii)  
      !$OMP DO SCHEDULE(STATIC)
       do i = 1,2
         if (i.eq.1) then 
          ii=mkl_set_num_threads_local(3)
          CALL pardiso (PT, maxfct, mnum, mtype, phase, sH%NoRows, sH%Vals, sH%RowIdxs, sH%ColIdxs, idum, nrhs, iparm, msglvl, myov, x1, error)
         else if (i.eq.2) then
           ii=mkl_set_num_threads_local(3)
           CALL pardiso (PT, maxfct, mnum, mtype, phase, sH%NoRows, sH%Vals, sH%RowIdxs, sH%ColIdxs, idum, nrhs, iparm, msglvl, myov, x2, error)
         end if
        end do
     !$OMP END DO    
     !$OMP END PARALLEL

So the idea is that the outer loop will split across the NUMA nodes, and mkl_set_num_threads_local(3) will then make pardiso spawn 3 threads on each NUMA-node to run pardiso on. The problem is that right now I just get 2 spawns of pardiso, but each spawn only runs on 1 thread, thus completely disregarding the mkl_set_num_threads_local(3)...

Hi,

Nice to see mention of MKL pardiso here! Can i ask you to set msglvl parameter to 1 and send output result to me? That could verify that pardiso work on 1 thread only

Thanks,

Alex

 

0 Kudos
Tue_B_
Novice
1,937 Views

Steve Lionel (Intel) wrote:

I got a rather interesting response from the MKL development team:

Control over thread affinity is customer responsibility and MKL does not change the affinity setting in any way. Another observation is that whole exercise of building NUMA-aware BLAS from MKL is pointless as MKL is aware of NUMA architecture and data locality. 

I showed my boss the above to which he told me that can not be. So together with him I made an example using the sparsematrix product, mkl_dcsrmultcsr, which shows a very bad numa scaling.

Given the sparsity of the matrix we are running this on, we would expect almost a linear speedup when throwing more numa-nodes after this problem, which we see is clearly not the case.

 Matrix loading time    105.785191362025
 Runtime 1 thread, no dist,GFLOPS:   171.182012695586
  5.374396539695677E-003
 Runtime 7 thread, no dist,GFLOPS:   26.0330252225976
  3.533972747395896E-002
 Runtime 56 thread, dist,GFLOPS:   9.05253138463013       0.101629033648127

So utilizing 8 numa nodes with 7 threads each instead of just numa node with 7 threads, gives us a speed-up of less than 3.

I have attached the matrix we used, and the program that do the calculation is the following which should run as a standalone:

    program SparseMatrixProduct
    use IFPORT
    use omp_lib
  implicit none

 
    real*8,allocatable          :: myVals(:),myVals2(:),myVals3(:)
    integer,allocatable         :: MyRows(:),MyCols(:),MyRows2(:),MyCols2(:),MyRows3(:),MyCols3(:)

    integer                         :: info,CNoElements,NoElements,NoRows,NoCols
    real*8                          :: RuntimeBegin,RunTimeEnd,Flops
    logical :: Success
    integer :: nrepeats
    integer :: N,I,J,NIte, First,Last
    
    integer,parameter               :: F=99
    character*256                   :: S
    success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8},{40:8},{48:8},{56:8}")
    !success = SETENVQQ("MKL_DYNAMIC=FALSE")
    
!$OMP PARALLEL DEFAULT(SHARED) 
!$OMP END PARALLEL

    !Number of repeats for the matrix product
    Nrepeats=50  
    !matrix dimensions
    NoRows=759660
    NoCols=759660
    NoElements=14081052
    
    allocate(MyVals(NoElements))
    allocate(MyCols(NoElements))
    allocate(MyRows(NoRows+1))
    allocate(MyVals2(NoElements))
    allocate(MyCols2(NoElements))
    allocate(MyRows2(NoRows+1))
    allocate(MyRows3(NoRows+1))
    
    RuntimeBegin=omp_get_wtime()
    
    !Row indices
    open(unit=F,file='c:\temp\004_rowIdxs.txt',status='unknown')
    read(F,'(A)') S
    do i=1,NoRows+1
      read(F,'(A)') S
      read(S,*) MyRows(i)
    end do
    close(F)
    
    !Col indices
    open(unit=F,file='c:\temp\004_colIdxs.txt',status='unknown')
    read(F,'(A)') S
    do i=1,NoElements
      read(F,'(A)') S
      read(S,*) MyCols(i)
    end do
    close(F)
    
    !Vals indices
    open(unit=F,file='c:\temp\004_values.txt',status='unknown')
    read(F,'(A)') S
    do i=1,NoElements
      read(F,'(A)') S
      read(S,*) MyVals(i)
    end do
    close(F)
    RuntimeEnd=omp_get_wtime()
    print *,'Matrix loading time ',RuntimeEnd-RuntimeBegin
  
    !Copy the matrix into the corresponding vectors for matrix 2
    MyVals2(:)=MyVals(:)
    MyCols2(:)=MyCols(:)
    MyRows2(:)=MyRows(:)
       
    
    !Now we run the matrix product on 1 node
    call omp_set_num_threads(1)
    RuntimeBegin=omp_get_wtime()
    do i=1,nrepeats
      call mkl_dcsrmultcsr('n', 1, 7,NoRows,NoCols, NoCols, MyVals, MyCols, MyRows, MyVals2, MyCols2, MyRows2, MyVals3, MyCols3, MyRows3, NoElements, info)
    end do
    CNoElements=MyRows3(NoRows+1)-1
    !Now allocate the output matrix!
    allocate(MyVals3(CNoElements))
    allocate(MyCols3(CNoElements))
    
    do i=1,nrepeats
      call mkl_dcsrmultcsr('n', 2, 7,NoRows,NoCols, NoCols, MyVals, MyCols, MyRows, MyVals2, MyCols2, MyRows2, MyVals3, MyCols3, MyRows3, NoElements, info)
    end do
    RuntimeEnd=omp_get_wtime()
    print *,'Runtime 1 thread, no dist,GFLOPS:',RuntimeEnd-RuntimeBegin,0.92/(RuntimeEnd-RuntimeBegin) !We know that the matrix product takes around 920M Flops.
    
    if(allocated(MyVals3)) deallocate(MyVals3)
    if(allocated(MyCols3)) deallocate(MyCols3)

    !We want to limit the affinity to one Numa-node now.
    success=SETENVQQ("OMP_PLACES={0:8}")
   
    call omp_set_num_threads(7)
    !$OMP PARALLEL DEFAULT(SHARED)
    !$OMP END PARALLEL

    
    RuntimeBegin=omp_get_wtime()
    do i=1,nrepeats
      call mkl_dcsrmultcsr('n', 1, 7,NoRows,NoCols, NoCols, MyVals, MyCols, MyRows, MyVals2, MyCols2, MyRows2, MyVals3, MyCols3, MyRows3, NoElements, info)
    end do
    CNoElements=MyRows3(NoRows+1)-1
    !Now allocate the output matrix!
    allocate(MyVals3(CNoElements))
    allocate(MyCols3(CNoElements))
    
    do i=1,nrepeats
      call mkl_dcsrmultcsr('n', 2, 7,NoRows,NoCols, NoCols, MyVals, MyCols, MyRows, MyVals2, MyCols2, MyRows2, MyVals3, MyCols3, MyRows3, NoElements, info)
    end do
    RuntimeEnd=omp_get_wtime()
    print *,'Runtime 7 thread, no dist,GFLOPS:',RuntimeEnd-RuntimeBegin,0.92/(RuntimeEnd-RuntimeBegin)

    !We now want to spread the matrix across the memory, in order to gain optimal performance on the numa system
    !We reallocate the first matrix, in order to utilize first touch principle    
    if(allocated(MyVals)) deallocate(MyVals)
    if(allocated(MyCols)) deallocate(MyCols)
    if(allocated(Myrows)) deallocate(Myrows)
    allocate(MyVals(NoElements))
    allocate(MyCols(NoElements))
    allocate(MyRows(NoRows+1))
    
    !Now we distribute the matrices before.
    success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8},{40:8},{48:8},{56:8}")
    call omp_set_num_threads(8)
    
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,First,Last) 
!$OMP DO SCHEDULE(STATIC)     
      do i=1,NoRows 
        First=MyRows2(i)
        Last=MyRows2(i+1)-1
        MyVals(First:Last)=MyVals2(First:Last)
        MyCols(First:Last)=MyCols2(First:Last)
        MyRows(i)=MyRows2(i) 
      end do
!$OMP END DO
!$OMP END PARALLEL

    !Now we do the same for Matrix 2
    if(allocated(MyVals2)) deallocate(MyVals2)
    if(allocated(MyCols2)) deallocate(MyCols2)
    if(allocated(Myrows2)) deallocate(Myrows2)
    allocate(MyVals2(NoElements))
    allocate(MyCols2(NoElements))
    allocate(MyRows2(NoRows+1))
    
    !Now we distribute the matrices before.
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,First,Last) 
!$OMP DO SCHEDULE(STATIC)     
      do i=1,NoRows 
        First=MyRows(i)
        Last=MyRows(i+1)-1
        MyVals2(First:Last)=MyVals(First:Last)
        MyCols2(First:Last)=MyCols(First:Last)
        MyRows2(i)=MyRows(i)
      end do
!$OMP END DO
!$OMP END PARALLEL
      
      

    
    if(allocated(MyRows3)) deallocate(MyRows3)
    if(allocated(MyVals3)) deallocate(MyVals3)
    if(allocated(MyCols3)) deallocate(MyCols3)
    allocate(MyRows3(NoRows+1))
    
    call omp_set_num_threads(56)
    RuntimeBegin=omp_get_wtime()
    do i=1,nrepeats
      call mkl_dcsrmultcsr('n', 1, 7,NoRows,NoCols, NoCols, MyVals, MyCols, MyRows, MyVals2, MyCols2, MyRows2, MyVals3, MyCols3, MyRows3, NoElements, info)
    end do
    CNoElements=MyRows3(NoRows+1)-1
    !Now allocate the output matrix!
    allocate(MyVals3(CNoElements))
    allocate(MyCols3(CNoElements))
    
    do i=1,nrepeats
      call mkl_dcsrmultcsr('n', 2, 7,NoRows,NoCols, NoCols, MyVals, MyCols, MyRows, MyVals2, MyCols2, MyRows2, MyVals3, MyCols3, MyRows3, NoElements, info)
    end do
    RuntimeEnd=omp_get_wtime()
    print *,'Runtime 56 thread, dist,GFLOPS:',RuntimeEnd-RuntimeBegin,0.92/(RuntimeEnd-RuntimeBegin)
    end program SparseMatrixProduct

 

0 Kudos
Tue_B_
Novice
1,937 Views

I have now also made a test of pardiso using the same parallel setup. But for pardiso the above clearly does not work, not only does the affinity not get respected - threads gets placed across all numa nodes even when I try to restrict them to only a few. But even worse they do not seem to respect the number of mkl threads I restrict them to. The code below should use less than 25% of my cpu power (I have 64 logic cpus available), yet I consistently see the pardiso solver use upwards to 70% of my cpus.

Can anyone confirm this behavior, and perhaps tell me what I do wrong?

Below is the pardiso output. I call pardiso twice, and it is worth noticing the huge amount of time it spends allocating internal structures the first time around - what exactly is this and why does it take this long?

Another thing worth mentioning is that for me the reordering takes 85% of the total time, which seems like a lot. 

Percentage of computed non-zeros for LL^T factorization
 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
 100


=== PARDISO: solving a real nonsymmetric system ===
1-based array indexing is turned ON
PARDISO double precision computation is turned ON
METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON
Single-level factorization algorithm is turned ON


Summary: ( starting phase is reordering, ending phase is solution )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.632378 s
Time spent in reordering of the initial matrix (reorder)         : 369.483572 s
Time spent in symbolic factorization (symbfct)                   : 1.168712 s
Time spent in data preparations for factorization (parlist)      : 0.139124 s
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct)                        : 66.482128 s
Time spent in direct solver at solve step (solve)                : 0.513812 s
Time spent in allocation of internal data structures (malloc)    : 642.525975 s
Time spent in additional calculations                            : 1.243984 s
Total time spent                                                 : 1082.189686 s


Statistics:
===========
Parallel Direct Factorization is running on 4 OpenMP

< Linear system Ax = b >
             number of equations:           759660
             number of non-zeros in A:      14081052
             number of non-zeros in A (): 0.002440

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 96
             number of independent subgraphs:  0
< Preprocessing with state of the art partitioning metis>
             number of supernodes:                    543609
             size of largest supernode:               1590
             number of non-zeros in L:                165815464
             number of non-zeros in U:                152153734
             number of non-zeros in L+U:              317969198
             gflop   for the numerical factorization: 255.499593

             gflop/s for the numerical factorization: 3.843132





Percentage of computed non-zeros for LL^T factorization
 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
 100


=== PARDISO: solving a real nonsymmetric system ===
1-based array indexing is turned ON
PARDISO double precision computation is turned ON
METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON
Single-level factorization algorithm is turned ON


Summary: ( starting phase is reordering, ending phase is solution )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.470824 s
Time spent in reordering of the initial matrix (reorder)         : 372.948610 s
Time spent in symbolic factorization (symbfct)                   : 1.373649 s
Time spent in data preparations for factorization (parlist)      : 0.137292 s
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct)                        : 55.114394 s
Time spent in direct solver at solve step (solve)                : 0.442787 s
Time spent in allocation of internal data structures (malloc)    : 1.351361 s
Time spent in additional calculations                            : 1.178430 s
Total time spent                                                 : 433.017346 s

Statistics:
===========
Parallel Direct Factorization is running on 4 OpenMP

< Linear system Ax = b >
             number of equations:           759660
             number of non-zeros in A:      14081052
             number of non-zeros in A (): 0.002440

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 96
             number of independent subgraphs:  0
< Preprocessing with state of the art partitioning metis>
             number of supernodes:                    543609
             size of largest supernode:               1590
             number of non-zeros in L:                165815464
             number of non-zeros in U:                152153734
             number of non-zeros in L+U:              317969198
             gflop   for the numerical factorization: 255.499593

             gflop/s for the numerical factorization: 4.635805

 Pardiso time   876.761621378828

 

Below you will find the code I used to generate the above.

(The attached matrix needs to be copied to c:\temp\  or a change in the input directory needs to be made)

PROGRAM PardisoTest

    use mkl_service
    use MKL_PARDISO
    use IFPORT
    use omp_lib
   implicit none

   logical(4) :: Success
   integer :: numaNodeCount,processorPackageCount,processorCoreCount,logicalProcessorCount
   integer,dimension(3) :: processorCacheCount   
   integer :: NoNUMANodes, blocksize,nrepeats,Runmode,t0
   integer :: N,I,J,NIte, First,Last,k,colidx,error,numofblocks,iii,ii,dim,d,threadID,NumaID
   integer :: Iter,Solver,NUMASize,m,ThreadsPrNuma,noRows
   integer, allocatable,dimension(:) :: GlobalThreadID
   real*8,allocatable,dimension(:,:) :: A, B,C,rA,rB,rC,cA,cB,cC,bA,bB,bbc
   real*8,allocatable,target :: bC(:,:)
   logical, allocatable, dimension(:) :: NumaNodeDone,MKlbusy
   real*8 :: Runtimeend, runtimebegin,Flops,Runtime
   real*8,pointer,dimension(:) ::vals,rhs,x,MyVals,MyRHS
   integer,pointer,dimension(:) ::row,col, myRows,myCols
   integer :: numofblocksprnuma, last_row,last_col,first_row,first_col,NumOfBlocksPrRow,MyNoElements,MyNoRows
    integer,parameter               :: F=99
    character*256                   :: S   
     
  
  
  TYPE(MKL_PARDISO_HANDLE), pointer :: PT(:) => null()
  integer, pointer :: idum(:) => null()
  real(KIND=8), pointer :: ddum(:) => null()
  INTEGER maxfct, mnum, mtype, phase, nrhs, msglvl
  INTEGER iparm(64)
    real*8,pointer,dimension(:) :: myoV
    real*8 :: error1
  

   call GetMachineArch(numaNodeCount,processorPackageCount,d,logicalProcessorCount,processorCacheCount ) 
  
   
   NoNUMANodes=3                     !How many NUMA nodes to distribute calculations over
   ThreadsPrNuma=4                   !How many threads to use pr. numa node 
   NRepeats = 2
   SELECT CASE (NoNUMANodes) !In this case we have 8 logical processors on each NUMA node.
   CASE(1)
   success=SETENVQQ("OMP_PLACES={0:8}") 
   CASE(2)
   success=SETENVQQ("OMP_PLACES={0:8},{8:8}")
   CASE(3)
   success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8}")
   CASE(4)
   success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8}")
   CASE(5)
   success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8}")
   CASE(6)
   success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8},{40:8}")
   CASE(7)
   success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8},{40:8},{48:8}")
   CASE(8)
   success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8},{40:8},{48:8},{56:8}")
   END SELECT       
   
   allocate(GlobalThreadID(NoNumanodes+1))
   call KMP_SET_STACKSIZE_S(990000000)
   call omp_set_dynamic(0)
   call mkl_set_dynamic(0)
   call omp_set_nested(1)
  
  allocate(A(200,200)) 
  allocate(B(200,200)) 
  allocate(C(200,200)) 
  A=0
  B=1
  C=0
 
     !intialization region
     call omp_set_num_threads(NoNumaNodes+1)
     !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ii,j) 
     !$OMP DO SCHEDULE(STATIC)
     do i = 1,NoNumanodes+1
        GlobalThreadID(i)=omp_get_thread_num()
        print *,'Thread binding for socket=',GlobalThreadID(i)
        SELECT CASE (GlobalThreadID(i))
        CASE(1)
        success=SETENVQQ("OMP_PLACES={0:8}")        
        CASE(2)
        success=SETENVQQ("OMP_PLACES={8:8}")
        CASE(3)
        success=SETENVQQ("OMP_PLACES={16:8}")
        CASE(4)
        success=SETENVQQ("OMP_PLACES={24:8}")
        CASE(5)
        success=SETENVQQ("OMP_PLACES={32:8}")
        CASE(6)
        success=SETENVQQ("OMP_PLACES={40:8}")
        CASE(7)
        success=SETENVQQ("OMP_PLACES={48:8}")
        CASE(8)
        success=SETENVQQ("OMP_PLACES={56:8}")
        END SELECT 

        if(GlobalThreadID(i).gt.0) then
            !Spawn the MKL_threadpool
            ii=mkl_set_num_threads_local(ThreadsPrNuma)
            call dgemm('N','N',2,2,2,1.d0,A,2,B,2,0.d0,C,2)
        end if
     end do
     !$OMP END DO
     !$OMP END PARALLEL 
     
     !Now we load the matrix we want to use pardiso on
    RuntimeBegin = omp_get_wtime()
    !difinition file
    open(unit=F,file='c:\temp\001_definition.txt',status='unknown')
    read(F,'(A)') S
    read(F,'(A)') S
    read(S,*) MyNoRows, MyNoRows, MyNoElements, i
    close(F)
    allocate(MyRows(MyNoRows+1))
    allocate(MyCols(MyNoElements))
    allocate(MyVals(MyNoElements))
    allocate(MyRHS(MyNoRows))
    print*,'NoRows',MyNoRows
    print*,'NoElements',MyNoElements

    
    !Row indices
    open(unit=F,file='c:\temp\001_rowIdxs.txt',status='unknown')
    read(F,'(A)') S
    do i=1,MyNoRows+1
      read(F,'(A)') S
      read(S,*) MyRows(i)
    end do
    close(F)
    
    !Col indices
    open(unit=F,file='c:\temp\001_colIdxs.txt',status='unknown')
    read(F,'(A)') S
    do i=1,MyNoElements
      read(F,'(A)') S
      read(S,*) MyCols(i)
    end do
    close(F)
    
    !Vals indices
    open(unit=F,file='c:\temp\001_values.txt',status='unknown')
    read(F,'(A)') S
    do i=1,MyNoElements
      read(F,'(A)') S
      read(S,*) MyVals(i)
    end do
    close(F)

    !RHS indices
    open(unit=F,file='c:\temp\001_RHS.txt',status='unknown')
    read(F,'(A)') S
    do i=1,MyNoRows
      read(F,'(A)') S
      read(S,*) MyRHS(i)
    end do
    close(F)
    RuntimeEnd=omp_get_wtime()
    print *,'Matrix loading time ',RuntimeEnd-RuntimeBegin
  
     
     Print*,'Now the pardiso part'
     call omp_set_num_threads(NoNumaNodes+1)
     RuntimeBegin = omp_get_wtime()
     !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ii,threadID,j,row,col,vals,rhs,x,PT,maxfct,mnum,mtype,phase,NoRows,idum,nrhs,iparm,msglvl,error)     
     !$OMP DO SCHEDULE(STATIC)
     do i = 1,NoNumanodes+1
        threadID=omp_get_thread_num()
        if((threadID.gt.0)) then
         allocate(row(MyNoRows+1))
         allocate(col(MyNoElements))
         allocate(vals(MyNoElements))
         allocate(rhs(MyNoRows))
         allocate(x(MyNoRows))
         allocate( PT(64), idum(0))
         row=MyRows
         col=MyCols
         vals=MyVals
         NoRows=MyNoRows
         rhs=MyRHS
         nrhs= 1
         maxfct =1
         mnum =1
         iparm = 0         
         error = 0 
         if((threadID.eq.1)) then
           msglvl = 1 
         else 
           msglvl = 0
         end if
         
         mtype = 11   
   
         do j = 1, 64
           PT(j)%DUMMY = 0
         end do
         phase = 13       
        do j=1,Nrepeats
          CALL pardiso (PT, maxfct, mnum, mtype, phase, NoRows, Vals,Row,Col, idum, nrhs, iparm, msglvl, rhs,x, error)
         end do
        end if 
     end do
     !$OMP END DO
     !$OMP END PARALLEL
     RuntimeEnd = omp_get_wtime()
     print*,'Pardiso time',RuntimeEnd-RunTimebegin
     
    END PROGRAM PardisoTest
       
 
    
    
   subroutine GetMachineArch(numaNodeCount,processorPackageCount,processorCoreCount,logicalProcessorCount,processorCacheCount )
   
   use, intrinsic :: ISO_C_BINDING
   use kernel32
   implicit none   
   integer,intent(out) :: numaNodeCount,processorPackageCount,processorCoreCount,logicalProcessorCount
   integer,dimension(3),intent(out) :: processorCacheCount     
   ! Variables
   procedure(GetLogicalProcessorInformation), pointer :: glpi
   type(T_SYSTEM_LOGICAL_PROCESSOR_INFORMATION), allocatable, dimension(:) :: buffer
   integer(DWORD) :: returnLength = 0
   integer(DWORD) :: ret
   integer :: nlpi, lpi_element_length, i 
    
    processorCacheCount=0
    numaNodeCount = 0
    processorCoreCount = 0
    logicalProcessorCount = 0
    processorPackageCount = 0
    
       call c_f_procpointer( &
       transfer( &
           GetProcAddress( &
               GetModuleHandle("kernel32"//C_NULL_CHAR), &
               "GetLogicalProcessorInformation"//C_NULL_CHAR &
               ), &
           C_NULL_FUNPTR &
           ), &
       glpi)
   
      if (.not. associated(glpi)) then
       print *,"GetLogicalProcessorInformation not supported"
       error stop
      end if
    
   ! We don't know in advance the size of the buffer we need. We'll pick a number, allocate it,
   ! and see if that's sufficient.  If not, we'll use the returned size information and reallocate
   ! the buffer to the required size.
   
   allocate (buffer(20))
   lpi_element_length = C_SIZEOF(buffer(1))
   returnLength = C_SIZEOF(buffer)
   ret = glpi(buffer, returnLength)
   if (ret == FALSE) then ! Failed
       if (GetLastError() == ERROR_INSUFFICIENT_BUFFER) then
           deallocate (buffer)
           allocate (buffer(returnLength/lpi_element_length))
           ret = glpi(buffer, returnLength)
           if (ret == FALSE) then
               print *,"GetLogicalProcessorInformation call failed with error code ", GetLastError()
               error stop
           end if
       else
           print *, "GetLogicalProcessorInformation call failed with error code ", GetLastError()
           error stop
       end if
   end if
   

   ! Now we can iterate through the elements of buffer and see what we can see
   
   do i=1, returnLength / lpi_element_length ! Number of elements in buffer
       select case (buffer(i)%Relationship)
       case(RelationNumaNode)
           ! NUMA nodes return one record of this type
           numaNodeCount = numaNodeCount + 1
       
       case(RelationProcessorCore)
           processorCoreCount = processorCoreCount + 1
           
           ! A Hyperthreaded core supplies more than one logical processor
           logicalProcessorCount = logicalProcessorCount + popcnt(buffer(i)%processorMask)
           
       case(RelationCache)
           ! One cache descriptor for each cache
           if (buffer(i)%Cache%Level > 0 .and. buffer(i)%Cache%Level <= 3) then
               processorCacheCount(buffer(i)%Cache%Level) = processorCacheCount(buffer(i)%Cache%Level) + 1
           else
               print *,"Invalid processor cache level ", buffer(i)%Cache%Level
           end if
           
       case(RelationProcessorPackage)
           !Logical processors share a physical package (socket)
           processorPackageCount = processorPackageCount + 1
           
       case default
           print *, "Unrecognized relationship code ", buffer(i)%Relationship
           
       end select
   end do
   
   if(allocated(buffer)) deallocate(buffer)

   
   print*,"GetLogicalProcessorInformation results:"
   print*, "  Number of NUMA nodes: ", numaNodeCount
   print*, "  Number of physical processor packages: ", processorPackageCount
  print*, "  Number of processor cores: ", processorCoreCount
 print*, "  Number of logical processors: ", logicalProcessorCount
print*, "  Number of processor L1/L2/L3 caches: ",processorCacheCount    

    end subroutine GetMachineArch
   
 

 

 

 

 

0 Kudos
Ying_H_Intel
Employee
1,937 Views

Hello,

Hope not make the saturation bad.  We will investigate the reordering time issue and get back later.  (seems it is running with 1 thread during the phase, so slow).

I add the message regarding the Affinity problem   & " How to set affinity of threads spawned by MKL?"

We had did some investigation about this and general message is  in the PDF  file of https://software.intel.com/en-us/articles/using-threaded-intel-mkl-in-multi-thread-application :

and some discussions also in

https://software.intel.com/en-us/articles/recommended-settings-for-calling-intelr-mkl-routines-from-multi-threaded-applications

https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-using-intel-mkl-with-threaded-applications

In that sample, we use

pthread as external level thread   ( in your case, your use OpenMP as external level thread,  it may be mess the external thread and internal thread)

and mkl as internal level thread.

basically,   MKL internal level thread doesn’t  follow  up  the  external level thread  affinity.  so you can't control the affinity of MKL internal level thread  by set the affinity of external level thread. So in your last example,  mkl pardiso thread don't respect your Numa Node control as your did it  for external threads.

Though OpenMP thread and pthread don’t know each other, we still can control the OpenMP thread’s affinity by inserting set affinity function at the point where MKL thread spawn.   So  it is still possible to control  MKL threading in your case, for example

1) don't control the affinity  and just set MKL thread number. Let's OS to control.  what the CPU usage look like?  

2) Control the MKL thread affinity carefully (not control the extern level omp threads).    for example, as jimdempseyatthecove suggested.

add affinity control after  mkl_set_num_threads_local().   for example, move the ii=mkl_set_num_threads_local(3) before the affinity  control

 

   allocate(GlobalThreadID(NoNumanodes+1))
   call KMP_SET_STACKSIZE_S(990000000)
   call omp_set_dynamic(0)
   call mkl_set_dynamic(0)
   call omp_set_nested(1)
   call omp_set_max_active_levels(2)

  call omp_set_num_threads(NoNumaNodes+1)
     !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ii,j)
     !$OMP DO SCHEDULE(STATIC)
     do i = 1,NoNumanodes+1

      ii=mkl_set_num_threads_local(3)
   /*  control affinity  as the Article.

The below control  is for external OpenMP.   please remove or change for internal MKL threads

GlobalThreadID(i)=omp_get_thread_num()
        print *,'Thread binding for socket=',GlobalThreadID(i)
        SELECT CASE (GlobalThreadID(i))
        CASE(1)
        success=SETENVQQ("OMP_PLACES={0:8}")   */

         do j = 1, 64
           PT(j)%DUMMY = 0
         end do
         phase = 13      
         do j=1,Nrepeats
          CALL pardiso (PT, maxfct, mnum, mtype, phase, NoRows, Vals,Row,Col, idum, nrhs, iparm, msglvl, rhs,x, error)
         end do

Best Regards,

Ying

0 Kudos
Tue_B_
Novice
1,937 Views

Hey Ying

Thank you for the comment!

Especially the part about putting the  ii=mkl_set_num_threads_local before the affinity settings, that seemed to do the trick. Now it is just a matter of the reordering and initiallization as previously mentioned.

Cheers

Tue

0 Kudos
Reply