- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello
I have been trying to optimize matrix multiplication on NUMA systems but so far without much luck.
I have played around with the dgemm routine and first touch.
A snippet of my code looks like this:
print*, 'initiating first touch' A=0 B=0 C=0 !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i) !$OMP DO SCHEDULE(STATIC) do i=1,dim rA(:,i)=0.d0 rB(:,i)=0.d0 rC(:,i)=0.d0 cA(i,:)=0.d0 cB(i,:)=0.d0 cC(i,:)=0.d0 end do !$OMP END DO !$OMP END PARALLEL print*, 'first touch done' Runtimebegin=omp_get_wtime() do i=1,10 call dgemm('N','N',dim,dim,dim,1.d0,A,dim,B,dim,0.d0,C,dim) end do Runtimeend=omp_get_wtime() print*, 'ABC', runtimeend-runtimebegin call amat(rA,y,z,n,m,mm) Runtimebegin=omp_get_wtime() do i=1,10 call dgemm('N','N',dim,dim,dim,1.d0,cA,dim,cB,dim,0.d0,cC,dim) end do Runtimeend=omp_get_wtime() print*, 'cAcBcC', runtimeend-runtimebegin Runtimebegin=omp_get_wtime() do i=1,10 call dgemm('N','N',dim,dim,dim,1.d0,rA,dim,rB,dim,0.d0,rC,dim) end do Runtimeend=omp_get_wtime() print*, 'rArBrC', runtimeend-runtimebegin Runtimebegin=omp_get_wtime() do i=1,10 call dgemm('N','N',dim,dim,dim,1.d0,cA,dim,rB,dim,0.d0,cC,dim) end do Runtimeend=omp_get_wtime() print*, 'cArBcC', runtimeend-runtimebegin
From which I get the following results on NUMA architechture.
ABC 37.0858174243185
cAcBcC 9.20657384615333
rArBrC 9.42917347316688
cArBcC 9.22702269622823
As we can see, it is important to spread your data across the NUMA system, but how it is spread seems to have little influence, or at least I haven't found an optimal way to spread it.
So I ask, is there an optimal way to spread my matrices when using dgemm?
Or if dgemm is not suitable on NUMA, are there any alternatives which handle matrix multiplication faster on NUMA architechture?
Thanks in advance
Tue
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your stride 1 array initialization ought to engage first touch locality optimization if you would set appropriate affinity by omp_proc_bind or kMP_affinity. I doubt it can be done for large stride.
mkL dgemm is written so as to minimize memory references so it may not depend strongly on locality. It has been demonstrated efficient on dual CPU.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
Have you tried numactl? You can specify interleaved memory mapping with numactl command.
$> numactl -i all <your binary>
Thanks,
Kazushige Goto
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey
Thanks for the input Tim, so far my tests have also shown that dgemm isn't dependent on locality.
numactl sounds like an easy way to control memory mapping, unfortunately I'm on windows and as far as I can see there isn't anything similar on there.
Does anyone know whether it is possible to specify the number of threads dgemm should use?
If I do something like the one below on my 64 core machine, I notice that dgemm still only uses 32 of the cores.
call omp_set_num_threads(64) Runtimebegin=omp_get_wtime() do i=1,100 call dgemm('N','N',dim,dim,dim,1.d0,cA,dim,cB,dim,0.d0,cC,dim) end do Runtimeend=omp_get_wtime()
And even worse is if I do something like the one below here, where NUMA=8 and ThreadsPrNUMA=7, then I see it only uses 8 threads in total, instead of the 56 I would expect.
call omp_set_num_threads(NUMA) !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,first,last) REDUCTION(+:bbc) !$OMP DO SCHEDULE(STATIC) do i=1,NUMA first=(i-1)*NUMAsize+1 call omp_set_num_threads(ThreadsPrNUMA) call dgemm('N','N',blocksize,blocksize,NUMAsize,1.d0,bA(1,first),dim,bB(first,1),dim,0.d0,bbC,blocksize) end do !$OMP END DO !$OMP END PARALLEL
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Apologies if I double post, I was logged out while posting.
Up to my retirement, only certain oem versions of Windows supported as many as 64 threads shared memory. A normal upgrade would lose that capability.
You might get some use from mkl_dynamic and mkl_num_threads, maybe omp_nested.
your inner omp_set_num_threads would take effect in the next parallel region if you didn't override it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think the problem is that dgemm automatically detects whether it is already in a parallel region, and if it is, it refuses to use additional threads, even when suggested.
The following in the basic idea for my matrix multiplication, but the major problem at the moment is that the dgemms each only run on 1 thread even though I just used omp_set_num_threads.
if((numaNodeCount==8).and.(logicalProcessorCount==64)) then !Beck server
success = SETENVQQ("OMP_PLACES={0,1,2,3,4,5,6,7},{8,9,10,11,12,13,14,15},{16,17,18,19,20,21,22,23},{24,25,26,27,28,29,30,31},{32,33,34,35,36,37,38,39},{40,41,42,43,44,45,46,47},{48,49,50,51,52,53,54,55},{56,57,58,59,60,61,62,63}") end if call KMP_SET_STACKSIZE_S(990000000) blocksize=3000 NUMA=8 ThreadsPrNuma=6 dim=blocksize*Numa NumOfBlocks=dim/blocksize NUMASize=dim/NUMA allocate(bA(dim,dim)) allocate(bB(dim,dim)) allocate(bC(dim,dim)) allocate(bbc(blocksize,blocksize)) call omp_set_num_threads(NUMA) !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,first,last) !$OMP DO SCHEDULE(STATIC) do i=1,NUMA first=(i-1)*NUMA+1 last=i*NUMA bA(:,first:last)=0.d0 bB(first:last,:)=0.d0 bC(:,first:last)=0.d0 end do !$OMP END DO !$OMP END PARALLEL call omp_set_dynamic(1) call omp_set_nested(1) print*,'Nested?',omp_get_nested() print*,'Dynamic?',omp_get_dynamic() print*, 'Threads?',omp_get_max_threads() !Setup complete, now we do the matrix multiplication Runtimebegin=omp_get_wtime() call omp_set_num_threads(NUMA) do i=1,NUMofBlocks do j=1,NUMofBlocks first_row=(i-1)*blocksize+1 first_col=(j-1)*blocksize+1 last_row=i*blocksize last_col=j*blocksize bbc = bc(first_row:last_row,first_col:last_col) !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(k,first,last) proc_bind(spread) REDUCTION(+:bbc) !$OMP DO SCHEDULE(STATIC) do k=1,NUMA first=(k-1)*NUMAsize+1 last=k*NUMAsize call omp_set_num_threads(ThreadsPrNuma) !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i) proc_bind(master) !$OMP END PARALLEL call dgemm('N','N',blocksize,blocksize,NUMAsize,1.d0,bA(first_row:last_row,first:last),dim,bB(first:last,first_col:last_col),dim,0.d0,bbc,blocksize) end do !$OMP END DO !$OMP END PARALLEL bc(first_row:last_row,first_col:last_col)=bbc end do end do Runtimeend=omp_get_wtime() print*, 'Matrix multiplication time', runtimeend-runtimebegin
I, perhaps naively, inserted:
call omp_set_num_threads(ThreadsPrNuma) !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i) proc_bind(master) !$OMP END PARALLEL
just before the dgemm call in the hope that it would use the number of threads specified and the affinity defined by proc_bind, but unfortunately to no effect.
Is there any way to do this?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page