- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm trying to do some optimization on a NUMA system at the moment and have run into trouble with regards to thread affinity.
The essence of the code I'm using can be seen below:
success = SETENVQQ("OMP_PLACES={0,1,2,3,4,5,6,7},{8,9,10,11,12,13,14,15},{16,17,18,19,20,21,22,23},{24,25,26,27,28,29,30,31},{32,33,34,35,36,37,38,39},{40,41,42,43,44,45,46,47},{48,49,50,51,52,53,54,55},{56,57,58,59,60,61,62,63}") NUMA=4 ThreadsPrNuma=2 call omp_set_dynamic(0) call mkl_set_dynamic(0) call omp_set_nested(1) call omp_set_num_threads(NUMA) call mkl_set_num_threads(NUMA) !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(k) proc_bind(spread) !$OMP DO SCHEDULE(STATIC) do k=1,NUMA !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i) proc_bind(master) !$OMP END PARALLEL call omp_set_num_threads(ThreadsPrNuma) call mkl_set_num_threads(ThreadsPrNuma) call MKL_routine_to_run_parallel_on_each_NUMA_node() end do !$OMP END DO !$OMP END PARALLEL
So the idea is that I have I want to do an outer paralization over each NUMA node - proc_bind(spread).
And then I want each of these threads to have an inner paralization over the number of processors in that NUMA node, the problem is that the MKL routine doesn't seem to be respecting the proc_bind(master) I set.
Anyone got any idea how to do such a thing?
Cheers
Tue
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey Jim, what you are suggesting is exactly what I have been doing in the last couple of hours and here is the result.
It works!
For some unknown reason a mkl pool on the master thread does not respect the affinity setting, but they do on all the other threads.
Here is a code example of something running with correct affinity even after the initialization.
Once I get a complete numa-aware dgemm matrix multiplication up and running I will post it along with some performance results.
Cheers!
program NumaAwareDGEMM use IFPORT use omp_lib use mkl_service implicit none logical(4) :: Success integer :: numaNodeCount,processorPackageCount,processorCoreCount,logicalProcessorCount integer,dimension(3) :: processorCacheCount integer :: NoNUMANodes, blocksize,nrepeats,Runmode,t0 integer :: N,I,J,NIte, First,Last,k,colidx,error,numofblocks,iii,ii,dim,d,threadID,NumaID integer :: Iter,Solver,NUMASize,m,ThreadsPrNuma integer, allocatable,dimension(:) :: GlobalThreadID real*8,allocatable,dimension(:,:) :: A, B,C,rA,rB,rC,cA,cB,cC,bA,bB,bbc real*8,allocatable,target :: bC(:,:) logical, allocatable, dimension(:) :: NumaNodeDone,MKlbusy !Set the mode to test call GetMachineArch(numaNodeCount,processorPackageCount,d,logicalProcessorCount,processorCacheCount ) NoNUMANodes=9 !How many NUMA nodes to distribute calculations over SELECT CASE (NoNUMANodes) CASE(1) success=SETENVQQ("OMP_PLACES={0:8}") CASE(2) success=SETENVQQ("OMP_PLACES={0:8},{8:8}") CASE(3) success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8}") CASE(4) success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8}") CASE(5) success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8}") CASE(6) success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8},{40:8}") CASE(7) success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8},{40:8},{48:8}") CASE(8) success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8},{40:8},{48:8},{56:8}") CASE(9) success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8},{40:8},{48:8},{56:8}") END SELECT !Create dummy matrices - we just matmul them uninitialized blocksize=6000 ThreadsPrNuma=4 !How many threads to use pr. numa node dim=blocksize*NoNUMANodes allocate(bA(dim,dim)) allocate(bB(dim,dim)) allocate(bC(dim,dim)) allocate(bbc(blocksize,blocksize)) allocate(GlobalThreadID(NoNumanodes*ThreadsPrNuma)) allocate(Mklbusy(NoNumanodes)) call KMP_SET_STACKSIZE_S(990000000) call omp_set_dynamic(0) call mkl_set_dynamic(0) call omp_set_nested(1) !intialization region call omp_set_num_threads(NoNumaNodes) !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ii,threadID,NumaID) !$OMP DO SCHEDULE(STATIC) do i = 1,NoNumanodes GlobalThreadID(i)=omp_get_thread_num() print *,'Thread binding for socket=',GlobalThreadID(i) SELECT CASE (GlobalThreadID(i)) CASE(1) success=SETENVQQ("OMP_PLACES={0:8}") CASE(2) success=SETENVQQ("OMP_PLACES={8:8}") CASE(3) success=SETENVQQ("OMP_PLACES={16:8}") CASE(4) success=SETENVQQ("OMP_PLACES={24:8}") CASE(5) success=SETENVQQ("OMP_PLACES={32:8}") CASE(6) success=SETENVQQ("OMP_PLACES={40:8}") CASE(7) success=SETENVQQ("OMP_PLACES={48:8}") CASE(8) success=SETENVQQ("OMP_PLACES={56:8}") END SELECT if(GlobalThreadID(i).gt.0) then call KMP_SET_BLOCKTIME(0) ii=mkl_set_num_threads_local(ThreadsPrNuma) call dgemm('N','N',blocksize,blocksize,blocksize,1.d0,bA,dim,bB,dim,0.d0,bbc,blocksize) end if end do !$OMP END DO !$OMP END PARALLEL MKLbusy = .true. ! !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ii,threadID,NumaID) ! all threads scheduled !$OMP DO SCHEDULE(STATIC) do i = 1,NoNumanodes threadID=omp_get_thread_num() print *,'Thread binding for socket=',GlobalThreadID(i),threadID if((threadID.gt.0)) then ! calculate tile for thread 0 call dgemm('N','N',blocksize,blocksize,blocksize,1.d0,bA,dim,bB,dim,0.d0,bbc,blocksize) end if end do !$OMP END DO !$OMP END PARALLEL end program NumaAwareDGEMM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tue,
You (we) managed to do something that a lot of other people had attempted to do, but for lack of documentation (or understanding on the internals of the relationship of MKL and OpemMP on the host program).
Steve L., If you've been following this thread, see if you can get Tue B.'s sample program above into the MKL User and Reference manual, with a good explanation of why this works and the nuances of using it.
Steve, Tue
Your example code above still needs a little more work.
Your line 93 !$OMP DO SCHEDULE(STATIC) will look at the following "do i=1,NoNumaNodes" and create a team of NoNumaNodes threads (0, 1, ... NoNumaNodes-1). This includes the process main thread as team member number 0, which you exclude with your "if(hreadID.gt.0))", but this is also excluding the parallel region's team member number NoNumaNodes which is now processing Numa Node 0's slice of the work.
MKLbusy = .true. ! !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ii,threadID,NumaID) ! all threads scheduled !$OMP DO SCHEDULE(STATIC) do i = 1,NoNumanodes+1 threadID=omp_get_thread_num() if((threadID.gt.0)) then ! calculate tile for Numa Node threadID-1 call dgemm('N','N',blocksize,blocksize,blocksize,1.d0,bA,dim,bB,dim,0.d0,bbc,blocksize) end if end do !$OMP END DO !$OMP END PARALLEL
I strongly suggest for the example code to go into the MKL reference guide, that the above sample is reworked to actually partition the working array. A version along the line of:
module Partition_module type Partition_t ! sufficient number of variables to specify ! the entire workspace (outside of parallel region) ! as well as a subsection for use by thread inside parallel region ... end type Partition_t ... end module Partition_module use Partition_module type(Partition_t) :: masterPartiton type(Partition_t), allocatable :: NumaNodePartition(:) ! allocated 1:NoNumaNodes ! initialize MKL pools ... ! Initialize masterPartition ... allocate(NumaNodePartition(NoNumaNodes)) ! 1:NoNumaNodes ! Using masterPartiton and NoNumaNodes, initialize each node partiton ... ! use code MKLbusy = .true. ! !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ii,threadID,NumaID) ! all threads scheduled threadID=omp_get_thread_num() if((threadID.gt.0).and.(threadID.le.NoNumaNodes) then call YourShell_dgemm(NumaNodePartition(threadID)) MKLbusy(threadID) = .false. else do while(any(MKLbusy)) SleepQQ(0) end do endif !$OMP END PARALLEL ... subroutine YourShell_dgemm(Partition) use Partition_module type(Partition_t) :: Partiton ... call dgemm('N','N', & & Partition%blocksize, & & Partition%blocksize, & & Partition%blocksize, & & 1.d0, & & Partition%bA, & & Partition%dim, & & Partition%bB, & & Partition%dim, & & 0.d0, & & Partition%bbc, & & Partition%blocksize) end subroutine YourShell_dgemm
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am going to move this to the MKL forum so the MKL experts there can look at it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I did send a request to the MKL documentation writers to look at this thread to see what they could use from it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
Your line 93 !$OMP DO SCHEDULE(STATIC) will look at the following "do i=1,NoNumaNodes" and create a team of NoNumaNodes threads (0, 1, ... NoNumaNodes-1). This includes the process main thread as team member number 0, which you exclude with your "if(hreadID.gt.0))", but this is also excluding the parallel region's team member number NoNumaNodes which is now processing Numa Node 0's slice of the work.
I'm not entirely sure I understand you correctly here, but what I think you are saying is the following:
If I do
do i = 1,NoNumanodes+1
threadID=omp_get_thread_num()
if((threadID.gt.0)) then
call dgemm('N','N',blocksize,blocksize,blocksize,1.d0,bA,dim,bB,dim,0.d0,bbc,blocksize)
end if
end do
then you are afraid that if((threadID.gt.0)) will not only exclude the master thread but also the last thread, which is meant to do the work instead of the master thread. Is that correct?
Cause if it is then that is not the case, the last thread gets the threadID equal to Nonumanodes and not 0.
That said there were a few things which were not too pretty in the last bit of code I posted, so here is a cleaner version of the last one once again. In the previous one I just increased Nonumanodes to 9 when I wanted to run 8 numanodes, in this one this is no longer the case.
I have added an image of the CPU load, to show that it does indeed distribute the load correctly. (the reason why node 7 is a bit higher is due to a process already running in the background)
program NumaAwareDGEMM use IFPORT use omp_lib use mkl_service implicit none logical(4) :: Success integer :: numaNodeCount,processorPackageCount,processorCoreCount,logicalProcessorCount integer,dimension(3) :: processorCacheCount integer :: NoNUMANodes, blocksize,nrepeats,Runmode,t0 integer :: N,I,J,NIte, First,Last,k,colidx,error,numofblocks,iii,ii,dim,d,threadID,NumaID integer :: Iter,Solver,NUMASize,m,ThreadsPrNuma integer, allocatable,dimension(:) :: GlobalThreadID real*8,allocatable,dimension(:,:) :: A, B,C,rA,rB,rC,cA,cB,cC,bA,bB,bbc real*8,allocatable,target :: bC(:,:) logical, allocatable, dimension(:) :: NumaNodeDone,MKlbusy call GetMachineArch(numaNodeCount,processorPackageCount,d,logicalProcessorCount,processorCacheCount ) NoNUMANodes=8 !How many NUMA nodes to distribute calculations over ThreadsPrNuma=4 !How many threads to use pr. numa node SELECT CASE (NoNUMANodes) !In this case we have 8 logical processors on each NUMA node. CASE(1) success=SETENVQQ("OMP_PLACES={0:8}") CASE(2) success=SETENVQQ("OMP_PLACES={0:8},{8:8}") CASE(3) success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8}") CASE(4) success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8}") CASE(5) success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8}") CASE(6) success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8},{40:8}") CASE(7) success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8},{40:8},{48:8}") CASE(8) success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8},{40:8},{48:8},{56:8}") END SELECT blocksize=6000 dim=blocksize*NoNUMANodes allocate(bA(dim,dim)) allocate(bB(dim,dim)) allocate(bC(dim,dim)) allocate(bbc(blocksize,blocksize)) allocate(GlobalThreadID(NoNumanodes+1)) call KMP_SET_STACKSIZE_S(990000000) call omp_set_dynamic(0) call mkl_set_dynamic(0) call omp_set_nested(1) !intialization region call omp_set_num_threads(NoNumaNodes+1) !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ii) !$OMP DO SCHEDULE(STATIC) do i = 1,NoNumanodes+1 GlobalThreadID(i)=omp_get_thread_num() print *,'Thread binding for socket=',GlobalThreadID(i) SELECT CASE (GlobalThreadID(i)) CASE(1) success=SETENVQQ("OMP_PLACES={0:8}") CASE(2) success=SETENVQQ("OMP_PLACES={8:8}") CASE(3) success=SETENVQQ("OMP_PLACES={16:8}") CASE(4) success=SETENVQQ("OMP_PLACES={24:8}") CASE(5) success=SETENVQQ("OMP_PLACES={32:8}") CASE(6) success=SETENVQQ("OMP_PLACES={40:8}") CASE(7) success=SETENVQQ("OMP_PLACES={48:8}") CASE(8) success=SETENVQQ("OMP_PLACES={56:8}") END SELECT if(GlobalThreadID(i).gt.0) then call KMP_SET_BLOCKTIME(0) ii=mkl_set_num_threads_local(ThreadsPrNuma) call dgemm('N','N',blocksize,blocksize,blocksize,1.d0,bA,dim,bB,dim,0.d0,bbc,blocksize) end if end do !$OMP END DO !$OMP END PARALLEL !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ii,threadID) !$OMP DO SCHEDULE(STATIC) do i = 1,NoNumanodes+1 threadID=omp_get_thread_num() print *,'Thread binding for socket=',GlobalThreadID(i),threadID if((threadID.gt.0)) then call dgemm('N','N',blocksize,blocksize,blocksize,1.d0,bA,dim,bB,dim,0.d0,bbc,blocksize) end if end do !$OMP END DO !$OMP END PARALLEL end program NumaAwareDGEMM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Good screen shot proof of pudding. Now for an eye opener
time0 = omp_get_wtime() do j=1,10 !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ii,threadID) !$OMP DO SCHEDULE(STATIC) do i = 1,NoNumanodes+1 threadID=omp_get_thread_num() print *,'Thread binding for socket=',GlobalThreadID(i),threadID if((threadID.gt.0)) then call dgemm('N','N',blocksize,blocksize,blocksize,1.d0,bA,dim,bB,dim,0.d0,bbc,blocksize) else SLEEPQQ(0) ! make one run with this, a second run with this commented out end if end do !$OMP END DO !$OMP END PARALLEL end do print *, "time=", omp_get_wtime() - time0
Then experiment with the MKLbusy flags
time0 = omp_get_wtime() MKLbusy = .true. do j=1,10 !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ii,threadID) !$OMP DO SCHEDULE(STATIC) do i = 1,NoNumanodes+1 threadID=omp_get_thread_num() print *,'Thread binding for socket=',GlobalThreadID(i),threadID if((threadID.gt.0)) then call dgemm('N','N',blocksize,blocksize,blocksize,1.d0,bA,dim,bB,dim,0.d0,bbc,blocksize) MKLbusy(threadID) = .false. else MKLbusy(threadID) = .false. do while(any(MKLbusy)) SLEEPQQ(0) ! make one run with this, a second run with this commented out end do end if end do !$OMP END DO !$OMP END PARALLEL end do print *, "time=", omp_get_wtime() - time0
The MKLbusy code above relieves the SpinWait time of master thread from NUMA node 0 (default ~200ms for that thread). With the do j=1,10 loop, you could see as much as 2 seconds difference in performance.
In a formal program you may want to use NoNumaNodes number of threads in parallel regions that are not using MKL. In this case you would want to use
if((threadID.gt.0).and.(threadID.le.NoNumaNodes) then
as MKL filter.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried playing around with the spintime, but didn't get much performance out of that.
I have also tested the numa-aware dgemm compared to the normal one and what I find is the following:
On a matrix with 16000 rows/columns
with a blocksize of 1000
and 8 numa nodes
with 3 threads pr numanode
I get around 200 GFlops.
This was the maximum speed I was able to get, even though each numanode should have 8 logical processors, but when I increased threads pr numanode above 3 I got slower speeds.
However the same matrix just solved in one block, distributed across the memory, I was able to get 340 GFlops, and it was able to utilize 7 threads pr numanode.
So it seems to me that the dgemm routine has some kind of numa-aware built into it already, otherwise I really don't see how such speeds are possible.
Next up I will do tests with numa-aware paradiso.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I got a rather interesting response from the MKL development team:
Control over thread affinity is customer responsibility and MKL does not change the affinity setting in any way. Another observation is that whole exercise of building NUMA-aware BLAS from MKL is pointless as MKL is aware of NUMA architecture and data locality.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Steve Lionel (Intel) wrote:
I got a rather interesting response from the MKL development team:
Control over thread affinity is customer responsibility and MKL does not change the affinity setting in any way. Another observation is that whole exercise of building NUMA-aware BLAS from MKL is pointless as MKL is aware of NUMA architecture and data locality.
Now that is very interesting! The last I heard was that MKL was NOT NUMA-aware, and when I search around on intel MKL page or through google I am still not able to find anything that says MKL is NUMA-aware. The closest I can find is this article:
which goes into the performance issue of MKL with NUMA, but also says the following:
Windows doesn't happen to provide the same sort of utility, so the performance on Windows is poor.
But that is great news if MKL is NUMA aware! Do you know whether they have any kind of documentation about how one should approach and use the MKL on NUMA systems?
On that same note... I have moved on to trying to paralize pardiso, but so far no success. I feel like this is a different issue I'm encountering here since I simple can't get pardiso to use anymore than 1 thread pr pardiso
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ii) !$OMP DO SCHEDULE(STATIC) do i = 1,2 if (i.eq.1) then ii=mkl_set_num_threads_local(3) CALL pardiso (PT, maxfct, mnum, mtype, phase, sH%NoRows, sH%Vals, sH%RowIdxs, sH%ColIdxs, idum, nrhs, iparm, msglvl, myov, x1, error) else if (i.eq.2) then ii=mkl_set_num_threads_local(3) CALL pardiso (PT, maxfct, mnum, mtype, phase, sH%NoRows, sH%Vals, sH%RowIdxs, sH%ColIdxs, idum, nrhs, iparm, msglvl, myov, x2, error) end if end do !$OMP END DO !$OMP END PARALLEL
So the idea is that the outer loop will split across the NUMA nodes, and mkl_set_num_threads_local(3) will then make pardiso spawn 3 threads on each NUMA-node to run pardiso on. The problem is that right now I just get 2 spawns of pardiso, but each spawn only runs on 1 thread, thus completely disregarding the mkl_set_num_threads_local(3)...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
On a single large matrix multiply, letting MKL figure things out is generally best. This said, if your program has multiple batches (frames), then you might find it best to pipeline these to the different nodes.
Also note the main benefit of NUMA aware programming is principally obtained by assuring the memory used resides on the NUMA node doing the processing. Although temp arrays (internal to MKL) may get allocated on a local node, the input and output arrays are generally allocated in the main thread and subsequent to allocation are "first touch" mapped to the first touching node. This requires that your first touching has to be performed on the nodes you want for use on later calls.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tue B. wrote:
Quote:
Steve Lionel (Intel) wrote:
I got a rather interesting response from the MKL development team:
Control over thread affinity is customer responsibility and MKL does not change the affinity setting in any way. Another observation is that whole exercise of building NUMA-aware BLAS from MKL is pointless as MKL is aware of NUMA architecture and data locality.
Now that is very interesting! The last I heard was that MKL was NOT NUMA-aware, and when I search around on intel MKL page or through google I am still not able to find anything that says MKL is NUMA-aware. The closest I can find is this article:
https://software.intel.com/en-us/articles/getting-high-performance-on-nu...
which goes into the performance issue of MKL with NUMA, but also says the following:
Windows doesn't happen to provide the same sort of utility, so the performance on Windows is poor.
But that is great news if MKL is NUMA aware! Do you know whether they have any kind of documentation about how one should approach and use the MKL on NUMA systems?
On that same note... I have moved on to trying to paralize pardiso, but so far no success. I feel like this is a different issue I'm encountering here since I simple can't get pardiso to use anymore than 1 thread pr pardiso
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ii) !$OMP DO SCHEDULE(STATIC) do i = 1,2 if (i.eq.1) then ii=mkl_set_num_threads_local(3) CALL pardiso (PT, maxfct, mnum, mtype, phase, sH%NoRows, sH%Vals, sH%RowIdxs, sH%ColIdxs, idum, nrhs, iparm, msglvl, myov, x1, error) else if (i.eq.2) then ii=mkl_set_num_threads_local(3) CALL pardiso (PT, maxfct, mnum, mtype, phase, sH%NoRows, sH%Vals, sH%RowIdxs, sH%ColIdxs, idum, nrhs, iparm, msglvl, myov, x2, error) end if end do !$OMP END DO !$OMP END PARALLELSo the idea is that the outer loop will split across the NUMA nodes, and mkl_set_num_threads_local(3) will then make pardiso spawn 3 threads on each NUMA-node to run pardiso on. The problem is that right now I just get 2 spawns of pardiso, but each spawn only runs on 1 thread, thus completely disregarding the mkl_set_num_threads_local(3)...
Hi,
Nice to see mention of MKL pardiso here! Can i ask you to set msglvl parameter to 1 and send output result to me? That could verify that pardiso work on 1 thread only
Thanks,
Alex
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Steve Lionel (Intel) wrote:
I got a rather interesting response from the MKL development team:
Control over thread affinity is customer responsibility and MKL does not change the affinity setting in any way. Another observation is that whole exercise of building NUMA-aware BLAS from MKL is pointless as MKL is aware of NUMA architecture and data locality.
I showed my boss the above to which he told me that can not be. So together with him I made an example using the sparsematrix product, mkl_dcsrmultcsr, which shows a very bad numa scaling.
Given the sparsity of the matrix we are running this on, we would expect almost a linear speedup when throwing more numa-nodes after this problem, which we see is clearly not the case.
Matrix loading time 105.785191362025
Runtime 1 thread, no dist,GFLOPS: 171.182012695586
5.374396539695677E-003
Runtime 7 thread, no dist,GFLOPS: 26.0330252225976
3.533972747395896E-002
Runtime 56 thread, dist,GFLOPS: 9.05253138463013 0.101629033648127
So utilizing 8 numa nodes with 7 threads each instead of just numa node with 7 threads, gives us a speed-up of less than 3.
I have attached the matrix we used, and the program that do the calculation is the following which should run as a standalone:
program SparseMatrixProduct use IFPORT use omp_lib implicit none real*8,allocatable :: myVals(:),myVals2(:),myVals3(:) integer,allocatable :: MyRows(:),MyCols(:),MyRows2(:),MyCols2(:),MyRows3(:),MyCols3(:) integer :: info,CNoElements,NoElements,NoRows,NoCols real*8 :: RuntimeBegin,RunTimeEnd,Flops logical :: Success integer :: nrepeats integer :: N,I,J,NIte, First,Last integer,parameter :: F=99 character*256 :: S success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8},{40:8},{48:8},{56:8}") !success = SETENVQQ("MKL_DYNAMIC=FALSE") !$OMP PARALLEL DEFAULT(SHARED) !$OMP END PARALLEL !Number of repeats for the matrix product Nrepeats=50 !matrix dimensions NoRows=759660 NoCols=759660 NoElements=14081052 allocate(MyVals(NoElements)) allocate(MyCols(NoElements)) allocate(MyRows(NoRows+1)) allocate(MyVals2(NoElements)) allocate(MyCols2(NoElements)) allocate(MyRows2(NoRows+1)) allocate(MyRows3(NoRows+1)) RuntimeBegin=omp_get_wtime() !Row indices open(unit=F,file='c:\temp\004_rowIdxs.txt',status='unknown') read(F,'(A)') S do i=1,NoRows+1 read(F,'(A)') S read(S,*) MyRows(i) end do close(F) !Col indices open(unit=F,file='c:\temp\004_colIdxs.txt',status='unknown') read(F,'(A)') S do i=1,NoElements read(F,'(A)') S read(S,*) MyCols(i) end do close(F) !Vals indices open(unit=F,file='c:\temp\004_values.txt',status='unknown') read(F,'(A)') S do i=1,NoElements read(F,'(A)') S read(S,*) MyVals(i) end do close(F) RuntimeEnd=omp_get_wtime() print *,'Matrix loading time ',RuntimeEnd-RuntimeBegin !Copy the matrix into the corresponding vectors for matrix 2 MyVals2(:)=MyVals(:) MyCols2(:)=MyCols(:) MyRows2(:)=MyRows(:) !Now we run the matrix product on 1 node call omp_set_num_threads(1) RuntimeBegin=omp_get_wtime() do i=1,nrepeats call mkl_dcsrmultcsr('n', 1, 7,NoRows,NoCols, NoCols, MyVals, MyCols, MyRows, MyVals2, MyCols2, MyRows2, MyVals3, MyCols3, MyRows3, NoElements, info) end do CNoElements=MyRows3(NoRows+1)-1 !Now allocate the output matrix! allocate(MyVals3(CNoElements)) allocate(MyCols3(CNoElements)) do i=1,nrepeats call mkl_dcsrmultcsr('n', 2, 7,NoRows,NoCols, NoCols, MyVals, MyCols, MyRows, MyVals2, MyCols2, MyRows2, MyVals3, MyCols3, MyRows3, NoElements, info) end do RuntimeEnd=omp_get_wtime() print *,'Runtime 1 thread, no dist,GFLOPS:',RuntimeEnd-RuntimeBegin,0.92/(RuntimeEnd-RuntimeBegin) !We know that the matrix product takes around 920M Flops. if(allocated(MyVals3)) deallocate(MyVals3) if(allocated(MyCols3)) deallocate(MyCols3) !We want to limit the affinity to one Numa-node now. success=SETENVQQ("OMP_PLACES={0:8}") call omp_set_num_threads(7) !$OMP PARALLEL DEFAULT(SHARED) !$OMP END PARALLEL RuntimeBegin=omp_get_wtime() do i=1,nrepeats call mkl_dcsrmultcsr('n', 1, 7,NoRows,NoCols, NoCols, MyVals, MyCols, MyRows, MyVals2, MyCols2, MyRows2, MyVals3, MyCols3, MyRows3, NoElements, info) end do CNoElements=MyRows3(NoRows+1)-1 !Now allocate the output matrix! allocate(MyVals3(CNoElements)) allocate(MyCols3(CNoElements)) do i=1,nrepeats call mkl_dcsrmultcsr('n', 2, 7,NoRows,NoCols, NoCols, MyVals, MyCols, MyRows, MyVals2, MyCols2, MyRows2, MyVals3, MyCols3, MyRows3, NoElements, info) end do RuntimeEnd=omp_get_wtime() print *,'Runtime 7 thread, no dist,GFLOPS:',RuntimeEnd-RuntimeBegin,0.92/(RuntimeEnd-RuntimeBegin) !We now want to spread the matrix across the memory, in order to gain optimal performance on the numa system !We reallocate the first matrix, in order to utilize first touch principle if(allocated(MyVals)) deallocate(MyVals) if(allocated(MyCols)) deallocate(MyCols) if(allocated(Myrows)) deallocate(Myrows) allocate(MyVals(NoElements)) allocate(MyCols(NoElements)) allocate(MyRows(NoRows+1)) !Now we distribute the matrices before. success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8},{40:8},{48:8},{56:8}") call omp_set_num_threads(8) !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,First,Last) !$OMP DO SCHEDULE(STATIC) do i=1,NoRows First=MyRows2(i) Last=MyRows2(i+1)-1 MyVals(First:Last)=MyVals2(First:Last) MyCols(First:Last)=MyCols2(First:Last) MyRows(i)=MyRows2(i) end do !$OMP END DO !$OMP END PARALLEL !Now we do the same for Matrix 2 if(allocated(MyVals2)) deallocate(MyVals2) if(allocated(MyCols2)) deallocate(MyCols2) if(allocated(Myrows2)) deallocate(Myrows2) allocate(MyVals2(NoElements)) allocate(MyCols2(NoElements)) allocate(MyRows2(NoRows+1)) !Now we distribute the matrices before. !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,First,Last) !$OMP DO SCHEDULE(STATIC) do i=1,NoRows First=MyRows(i) Last=MyRows(i+1)-1 MyVals2(First:Last)=MyVals(First:Last) MyCols2(First:Last)=MyCols(First:Last) MyRows2(i)=MyRows(i) end do !$OMP END DO !$OMP END PARALLEL if(allocated(MyRows3)) deallocate(MyRows3) if(allocated(MyVals3)) deallocate(MyVals3) if(allocated(MyCols3)) deallocate(MyCols3) allocate(MyRows3(NoRows+1)) call omp_set_num_threads(56) RuntimeBegin=omp_get_wtime() do i=1,nrepeats call mkl_dcsrmultcsr('n', 1, 7,NoRows,NoCols, NoCols, MyVals, MyCols, MyRows, MyVals2, MyCols2, MyRows2, MyVals3, MyCols3, MyRows3, NoElements, info) end do CNoElements=MyRows3(NoRows+1)-1 !Now allocate the output matrix! allocate(MyVals3(CNoElements)) allocate(MyCols3(CNoElements)) do i=1,nrepeats call mkl_dcsrmultcsr('n', 2, 7,NoRows,NoCols, NoCols, MyVals, MyCols, MyRows, MyVals2, MyCols2, MyRows2, MyVals3, MyCols3, MyRows3, NoElements, info) end do RuntimeEnd=omp_get_wtime() print *,'Runtime 56 thread, dist,GFLOPS:',RuntimeEnd-RuntimeBegin,0.92/(RuntimeEnd-RuntimeBegin) end program SparseMatrixProduct
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have now also made a test of pardiso using the same parallel setup. But for pardiso the above clearly does not work, not only does the affinity not get respected - threads gets placed across all numa nodes even when I try to restrict them to only a few. But even worse they do not seem to respect the number of mkl threads I restrict them to. The code below should use less than 25% of my cpu power (I have 64 logic cpus available), yet I consistently see the pardiso solver use upwards to 70% of my cpus.
Can anyone confirm this behavior, and perhaps tell me what I do wrong?
Below is the pardiso output. I call pardiso twice, and it is worth noticing the huge amount of time it spends allocating internal structures the first time around - what exactly is this and why does it take this long?
Another thing worth mentioning is that for me the reordering takes 85% of the total time, which seems like a lot.
Percentage of computed non-zeros for LL^T factorization 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 === PARDISO: solving a real nonsymmetric system === 1-based array indexing is turned ON PARDISO double precision computation is turned ON METIS algorithm at reorder step is turned ON Scaling is turned ON Matching is turned ON Single-level factorization algorithm is turned ON Summary: ( starting phase is reordering, ending phase is solution ) ================ Times: ====== Time spent in calculations of symmetric matrix portrait (fulladj): 0.632378 s Time spent in reordering of the initial matrix (reorder) : 369.483572 s Time spent in symbolic factorization (symbfct) : 1.168712 s Time spent in data preparations for factorization (parlist) : 0.139124 s Time spent in copying matrix to internal data structure (A to LU): 0.000000 s Time spent in factorization step (numfct) : 66.482128 s Time spent in direct solver at solve step (solve) : 0.513812 s Time spent in allocation of internal data structures (malloc) : 642.525975 s Time spent in additional calculations : 1.243984 s Total time spent : 1082.189686 s Statistics: =========== Parallel Direct Factorization is running on 4 OpenMP < Linear system Ax = b > number of equations: 759660 number of non-zeros in A: 14081052 number of non-zeros in A (): 0.002440 number of right-hand sides: 1 < Factors L and U > number of columns for each panel: 96 number of independent subgraphs: 0 < Preprocessing with state of the art partitioning metis> number of supernodes: 543609 size of largest supernode: 1590 number of non-zeros in L: 165815464 number of non-zeros in U: 152153734 number of non-zeros in L+U: 317969198 gflop for the numerical factorization: 255.499593 gflop/s for the numerical factorization: 3.843132 Percentage of computed non-zeros for LL^T factorization 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 === PARDISO: solving a real nonsymmetric system === 1-based array indexing is turned ON PARDISO double precision computation is turned ON METIS algorithm at reorder step is turned ON Scaling is turned ON Matching is turned ON Single-level factorization algorithm is turned ON Summary: ( starting phase is reordering, ending phase is solution ) ================ Times: ====== Time spent in calculations of symmetric matrix portrait (fulladj): 0.470824 s Time spent in reordering of the initial matrix (reorder) : 372.948610 s Time spent in symbolic factorization (symbfct) : 1.373649 s Time spent in data preparations for factorization (parlist) : 0.137292 s Time spent in copying matrix to internal data structure (A to LU): 0.000000 s Time spent in factorization step (numfct) : 55.114394 s Time spent in direct solver at solve step (solve) : 0.442787 s Time spent in allocation of internal data structures (malloc) : 1.351361 s Time spent in additional calculations : 1.178430 s Total time spent : 433.017346 s Statistics: =========== Parallel Direct Factorization is running on 4 OpenMP < Linear system Ax = b > number of equations: 759660 number of non-zeros in A: 14081052 number of non-zeros in A (): 0.002440 number of right-hand sides: 1 < Factors L and U > number of columns for each panel: 96 number of independent subgraphs: 0 < Preprocessing with state of the art partitioning metis> number of supernodes: 543609 size of largest supernode: 1590 number of non-zeros in L: 165815464 number of non-zeros in U: 152153734 number of non-zeros in L+U: 317969198 gflop for the numerical factorization: 255.499593 gflop/s for the numerical factorization: 4.635805 Pardiso time 876.761621378828
Below you will find the code I used to generate the above.
(The attached matrix needs to be copied to c:\temp\ or a change in the input directory needs to be made)
PROGRAM PardisoTest use mkl_service use MKL_PARDISO use IFPORT use omp_lib implicit none logical(4) :: Success integer :: numaNodeCount,processorPackageCount,processorCoreCount,logicalProcessorCount integer,dimension(3) :: processorCacheCount integer :: NoNUMANodes, blocksize,nrepeats,Runmode,t0 integer :: N,I,J,NIte, First,Last,k,colidx,error,numofblocks,iii,ii,dim,d,threadID,NumaID integer :: Iter,Solver,NUMASize,m,ThreadsPrNuma,noRows integer, allocatable,dimension(:) :: GlobalThreadID real*8,allocatable,dimension(:,:) :: A, B,C,rA,rB,rC,cA,cB,cC,bA,bB,bbc real*8,allocatable,target :: bC(:,:) logical, allocatable, dimension(:) :: NumaNodeDone,MKlbusy real*8 :: Runtimeend, runtimebegin,Flops,Runtime real*8,pointer,dimension(:) ::vals,rhs,x,MyVals,MyRHS integer,pointer,dimension(:) ::row,col, myRows,myCols integer :: numofblocksprnuma, last_row,last_col,first_row,first_col,NumOfBlocksPrRow,MyNoElements,MyNoRows integer,parameter :: F=99 character*256 :: S TYPE(MKL_PARDISO_HANDLE), pointer :: PT(:) => null() integer, pointer :: idum(:) => null() real(KIND=8), pointer :: ddum(:) => null() INTEGER maxfct, mnum, mtype, phase, nrhs, msglvl INTEGER iparm(64) real*8,pointer,dimension(:) :: myoV real*8 :: error1 call GetMachineArch(numaNodeCount,processorPackageCount,d,logicalProcessorCount,processorCacheCount ) NoNUMANodes=3 !How many NUMA nodes to distribute calculations over ThreadsPrNuma=4 !How many threads to use pr. numa node NRepeats = 2 SELECT CASE (NoNUMANodes) !In this case we have 8 logical processors on each NUMA node. CASE(1) success=SETENVQQ("OMP_PLACES={0:8}") CASE(2) success=SETENVQQ("OMP_PLACES={0:8},{8:8}") CASE(3) success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8}") CASE(4) success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8}") CASE(5) success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8}") CASE(6) success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8},{40:8}") CASE(7) success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8},{40:8},{48:8}") CASE(8) success=SETENVQQ("OMP_PLACES={0:8},{8:8},{16:8},{24:8},{32:8},{40:8},{48:8},{56:8}") END SELECT allocate(GlobalThreadID(NoNumanodes+1)) call KMP_SET_STACKSIZE_S(990000000) call omp_set_dynamic(0) call mkl_set_dynamic(0) call omp_set_nested(1) allocate(A(200,200)) allocate(B(200,200)) allocate(C(200,200)) A=0 B=1 C=0 !intialization region call omp_set_num_threads(NoNumaNodes+1) !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ii,j) !$OMP DO SCHEDULE(STATIC) do i = 1,NoNumanodes+1 GlobalThreadID(i)=omp_get_thread_num() print *,'Thread binding for socket=',GlobalThreadID(i) SELECT CASE (GlobalThreadID(i)) CASE(1) success=SETENVQQ("OMP_PLACES={0:8}") CASE(2) success=SETENVQQ("OMP_PLACES={8:8}") CASE(3) success=SETENVQQ("OMP_PLACES={16:8}") CASE(4) success=SETENVQQ("OMP_PLACES={24:8}") CASE(5) success=SETENVQQ("OMP_PLACES={32:8}") CASE(6) success=SETENVQQ("OMP_PLACES={40:8}") CASE(7) success=SETENVQQ("OMP_PLACES={48:8}") CASE(8) success=SETENVQQ("OMP_PLACES={56:8}") END SELECT if(GlobalThreadID(i).gt.0) then !Spawn the MKL_threadpool ii=mkl_set_num_threads_local(ThreadsPrNuma) call dgemm('N','N',2,2,2,1.d0,A,2,B,2,0.d0,C,2) end if end do !$OMP END DO !$OMP END PARALLEL !Now we load the matrix we want to use pardiso on RuntimeBegin = omp_get_wtime() !difinition file open(unit=F,file='c:\temp\001_definition.txt',status='unknown') read(F,'(A)') S read(F,'(A)') S read(S,*) MyNoRows, MyNoRows, MyNoElements, i close(F) allocate(MyRows(MyNoRows+1)) allocate(MyCols(MyNoElements)) allocate(MyVals(MyNoElements)) allocate(MyRHS(MyNoRows)) print*,'NoRows',MyNoRows print*,'NoElements',MyNoElements !Row indices open(unit=F,file='c:\temp\001_rowIdxs.txt',status='unknown') read(F,'(A)') S do i=1,MyNoRows+1 read(F,'(A)') S read(S,*) MyRows(i) end do close(F) !Col indices open(unit=F,file='c:\temp\001_colIdxs.txt',status='unknown') read(F,'(A)') S do i=1,MyNoElements read(F,'(A)') S read(S,*) MyCols(i) end do close(F) !Vals indices open(unit=F,file='c:\temp\001_values.txt',status='unknown') read(F,'(A)') S do i=1,MyNoElements read(F,'(A)') S read(S,*) MyVals(i) end do close(F) !RHS indices open(unit=F,file='c:\temp\001_RHS.txt',status='unknown') read(F,'(A)') S do i=1,MyNoRows read(F,'(A)') S read(S,*) MyRHS(i) end do close(F) RuntimeEnd=omp_get_wtime() print *,'Matrix loading time ',RuntimeEnd-RuntimeBegin Print*,'Now the pardiso part' call omp_set_num_threads(NoNumaNodes+1) RuntimeBegin = omp_get_wtime() !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ii,threadID,j,row,col,vals,rhs,x,PT,maxfct,mnum,mtype,phase,NoRows,idum,nrhs,iparm,msglvl,error) !$OMP DO SCHEDULE(STATIC) do i = 1,NoNumanodes+1 threadID=omp_get_thread_num() if((threadID.gt.0)) then allocate(row(MyNoRows+1)) allocate(col(MyNoElements)) allocate(vals(MyNoElements)) allocate(rhs(MyNoRows)) allocate(x(MyNoRows)) allocate( PT(64), idum(0)) row=MyRows col=MyCols vals=MyVals NoRows=MyNoRows rhs=MyRHS nrhs= 1 maxfct =1 mnum =1 iparm = 0 error = 0 if((threadID.eq.1)) then msglvl = 1 else msglvl = 0 end if mtype = 11 do j = 1, 64 PT(j)%DUMMY = 0 end do phase = 13 do j=1,Nrepeats CALL pardiso (PT, maxfct, mnum, mtype, phase, NoRows, Vals,Row,Col, idum, nrhs, iparm, msglvl, rhs,x, error) end do end if end do !$OMP END DO !$OMP END PARALLEL RuntimeEnd = omp_get_wtime() print*,'Pardiso time',RuntimeEnd-RunTimebegin END PROGRAM PardisoTest subroutine GetMachineArch(numaNodeCount,processorPackageCount,processorCoreCount,logicalProcessorCount,processorCacheCount ) use, intrinsic :: ISO_C_BINDING use kernel32 implicit none integer,intent(out) :: numaNodeCount,processorPackageCount,processorCoreCount,logicalProcessorCount integer,dimension(3),intent(out) :: processorCacheCount ! Variables procedure(GetLogicalProcessorInformation), pointer :: glpi type(T_SYSTEM_LOGICAL_PROCESSOR_INFORMATION), allocatable, dimension(:) :: buffer integer(DWORD) :: returnLength = 0 integer(DWORD) :: ret integer :: nlpi, lpi_element_length, i processorCacheCount=0 numaNodeCount = 0 processorCoreCount = 0 logicalProcessorCount = 0 processorPackageCount = 0 call c_f_procpointer( & transfer( & GetProcAddress( & GetModuleHandle("kernel32"//C_NULL_CHAR), & "GetLogicalProcessorInformation"//C_NULL_CHAR & ), & C_NULL_FUNPTR & ), & glpi) if (.not. associated(glpi)) then print *,"GetLogicalProcessorInformation not supported" error stop end if ! We don't know in advance the size of the buffer we need. We'll pick a number, allocate it, ! and see if that's sufficient. If not, we'll use the returned size information and reallocate ! the buffer to the required size. allocate (buffer(20)) lpi_element_length = C_SIZEOF(buffer(1)) returnLength = C_SIZEOF(buffer) ret = glpi(buffer, returnLength) if (ret == FALSE) then ! Failed if (GetLastError() == ERROR_INSUFFICIENT_BUFFER) then deallocate (buffer) allocate (buffer(returnLength/lpi_element_length)) ret = glpi(buffer, returnLength) if (ret == FALSE) then print *,"GetLogicalProcessorInformation call failed with error code ", GetLastError() error stop end if else print *, "GetLogicalProcessorInformation call failed with error code ", GetLastError() error stop end if end if ! Now we can iterate through the elements of buffer and see what we can see do i=1, returnLength / lpi_element_length ! Number of elements in buffer select case (buffer(i)%Relationship) case(RelationNumaNode) ! NUMA nodes return one record of this type numaNodeCount = numaNodeCount + 1 case(RelationProcessorCore) processorCoreCount = processorCoreCount + 1 ! A Hyperthreaded core supplies more than one logical processor logicalProcessorCount = logicalProcessorCount + popcnt(buffer(i)%processorMask) case(RelationCache) ! One cache descriptor for each cache if (buffer(i)%Cache%Level > 0 .and. buffer(i)%Cache%Level <= 3) then processorCacheCount(buffer(i)%Cache%Level) = processorCacheCount(buffer(i)%Cache%Level) + 1 else print *,"Invalid processor cache level ", buffer(i)%Cache%Level end if case(RelationProcessorPackage) !Logical processors share a physical package (socket) processorPackageCount = processorPackageCount + 1 case default print *, "Unrecognized relationship code ", buffer(i)%Relationship end select end do if(allocated(buffer)) deallocate(buffer) print*,"GetLogicalProcessorInformation results:" print*, " Number of NUMA nodes: ", numaNodeCount print*, " Number of physical processor packages: ", processorPackageCount print*, " Number of processor cores: ", processorCoreCount print*, " Number of logical processors: ", logicalProcessorCount print*, " Number of processor L1/L2/L3 caches: ",processorCacheCount end subroutine GetMachineArch
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
Hope not make the saturation bad. We will investigate the reordering time issue and get back later. (seems it is running with 1 thread during the phase, so slow).
I add the message regarding the Affinity problem & " How to set affinity of threads spawned by MKL?"
We had did some investigation about this and general message is in the PDF file of https://software.intel.com/en-us/articles/using-threaded-intel-mkl-in-multi-thread-application :
and some discussions also in
https://software.intel.com/en-us/articles/recommended-settings-for-calling-intelr-mkl-routines-from-multi-threaded-applications
https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-using-intel-mkl-with-threaded-applications
In that sample, we use
pthread as external level thread ( in your case, your use OpenMP as external level thread, it may be mess the external thread and internal thread)
and mkl as internal level thread.
basically, MKL internal level thread doesn’t follow up the external level thread affinity. so you can't control the affinity of MKL internal level thread by set the affinity of external level thread. So in your last example, mkl pardiso thread don't respect your Numa Node control as your did it for external threads.
Though OpenMP thread and pthread don’t know each other, we still can control the OpenMP thread’s affinity by inserting set affinity function at the point where MKL thread spawn. So it is still possible to control MKL threading in your case, for example
1) don't control the affinity and just set MKL thread number. Let's OS to control. what the CPU usage look like?
2) Control the MKL thread affinity carefully (not control the extern level omp threads). for example, as jimdempseyatthecove suggested.
add affinity control after mkl_set_num_threads_local(). for example, move the ii=mkl_set_num_threads_local(3) before the affinity control
allocate(GlobalThreadID(NoNumanodes+1))
call KMP_SET_STACKSIZE_S(990000000)
call omp_set_dynamic(0)
call mkl_set_dynamic(0)
call omp_set_nested(1)
call omp_set_max_active_levels(2)
call omp_set_num_threads(NoNumaNodes+1)
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,ii,j)
!$OMP DO SCHEDULE(STATIC)
do i = 1,NoNumanodes+1
ii=mkl_set_num_threads_local(3)
/* control affinity as the Article.
The below control is for external OpenMP. please remove or change for internal MKL threads
GlobalThreadID(i)=omp_get_thread_num()
print *,'Thread binding for socket=',GlobalThreadID(i)
SELECT CASE (GlobalThreadID(i))
CASE(1)
success=SETENVQQ("OMP_PLACES={0:8}") */
do j = 1, 64
PT(j)%DUMMY = 0
end do
phase = 13
do j=1,Nrepeats
CALL pardiso (PT, maxfct, mnum, mtype, phase, NoRows, Vals,Row,Col, idum, nrhs, iparm, msglvl, rhs,x, error)
end do
Best Regards,
Ying
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey Ying
Thank you for the comment!
Especially the part about putting the ii=mkl_set_num_threads_local before the affinity settings, that seemed to do the trick. Now it is just a matter of the reordering and initiallization as previously mentioned.
Cheers
Tue
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »