topic optimizing dgemm on NUMA systems in Intel® oneAPI Math Kernel Library

optimizing dgemm on NUMA systems

Tue_B_ — Wed, 04 Mar 2015 14:10:48 GMT

Hello

I have been trying to optimize matrix multiplication on NUMA systems but so far without much luck.

I have played around with the dgemm routine and first touch.

A snippet of my code looks like this:

     print*, 'initiating first touch'
    A=0
    B=0
    C=0
    !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i)
    !$OMP DO SCHEDULE(STATIC)
    do i=1,dim
        rA(:,i)=0.d0
        rB(:,i)=0.d0
        rC(:,i)=0.d0
        cA(i,:)=0.d0
        cB(i,:)=0.d0
        cC(i,:)=0.d0
    end do
    !$OMP END DO
    !$OMP END PARALLEL
    
    print*, 'first touch done'
      
    Runtimebegin=omp_get_wtime()
    do i=1,10
    call dgemm('N','N',dim,dim,dim,1.d0,A,dim,B,dim,0.d0,C,dim)
    end do
    Runtimeend=omp_get_wtime()

    print*, 'ABC', runtimeend-runtimebegin

    call amat(rA,y,z,n,m,mm) 
    
    Runtimebegin=omp_get_wtime()
    do i=1,10
    call dgemm('N','N',dim,dim,dim,1.d0,cA,dim,cB,dim,0.d0,cC,dim)
    end do
    Runtimeend=omp_get_wtime()

    print*, 'cAcBcC', runtimeend-runtimebegin
    
    Runtimebegin=omp_get_wtime()
    do i=1,10
    call dgemm('N','N',dim,dim,dim,1.d0,rA,dim,rB,dim,0.d0,rC,dim)
    end do
    Runtimeend=omp_get_wtime()

    print*, 'rArBrC', runtimeend-runtimebegin
    
    
    Runtimebegin=omp_get_wtime()
    do i=1,10
    call dgemm('N','N',dim,dim,dim,1.d0,cA,dim,rB,dim,0.d0,cC,dim)
    end do
    Runtimeend=omp_get_wtime()

    print*, 'cArBcC', runtimeend-runtimebegin

From which I get the following results on NUMA architechture.

ABC 37.0858174243185
cAcBcC 9.20657384615333
rArBrC 9.42917347316688
cArBcC 9.22702269622823

As we can see, it is important to spread your data across the NUMA system, but how it is spread seems to have little influence, or at least I haven't found an optimal way to spread it.

So I ask, is there an optimal way to spread my matrices when using dgemm?

Or if dgemm is not suitable on NUMA, are there any alternatives which handle matrix multiplication faster on NUMA architechture?

Thanks in advance

Tue

Your stride 1 array

TimP — Wed, 04 Mar 2015 15:39:25 GMT

Your stride 1 array initialization ought to engage first touch locality optimization if you would set appropriate affinity by omp_proc_bind or kMP_affinity. I doubt it can be done for large stride.

mkL dgemm is written so as to minimize memory references so it may not depend strongly on locality. It has been demonstrated efficient on dual CPU.

Hello,

Kazushige_G_Intel — Fri, 06 Mar 2015 19:19:12 GMT

Hello,

Have you tried numactl? You can specify interleaved memory mapping with numactl command.

$> numactl -i all <your binary>

Thanks,

Kazushige Goto

Hey

Tue_B_ — Tue, 10 Mar 2015 10:42:32 GMT

Hey

Thanks for the input Tim, so far my tests have also shown that dgemm isn't dependent on locality.

numactl sounds like an easy way to control memory mapping, unfortunately I'm on windows and as far as I can see there isn't anything similar on there.

Does anyone know whether it is possible to specify the number of threads dgemm should use?

If I do something like the one below on my 64 core machine, I notice that dgemm still only uses 32 of the cores.

    call omp_set_num_threads(64)    
    Runtimebegin=omp_get_wtime()
    do i=1,100
    call dgemm('N','N',dim,dim,dim,1.d0,cA,dim,cB,dim,0.d0,cC,dim)
    end do
    Runtimeend=omp_get_wtime()

And even worse is if I do something like the one below here, where NUMA=8 and ThreadsPrNUMA=7, then I see it only uses 8 threads in total, instead of the 56 I would expect.

  call omp_set_num_threads(NUMA)
  !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,first,last) REDUCTION(+:bbc)   
    !$OMP DO SCHEDULE(STATIC)
    do i=1,NUMA
      first=(i-1)*NUMAsize+1
       call omp_set_num_threads(ThreadsPrNUMA)
       call dgemm('N','N',blocksize,blocksize,NUMAsize,1.d0,bA(1,first),dim,bB(first,1),dim,0.d0,bbC,blocksize)
    end do
    !$OMP END DO
    !$OMP END PARALLEL

Apologies if I double post, I

TimP — Tue, 10 Mar 2015 13:39:32 GMT

Apologies if I double post, I was logged out while posting.

Up to my retirement, only certain oem versions of Windows supported as many as 64 threads shared memory. A normal upgrade would lose that capability.

You might get some use from mkl_dynamic and mkl_num_threads, maybe omp_nested.

your inner omp_set_num_threads would take effect in the next parallel region if you didn't override it.

I think the problem is that

Tue_B_ — Wed, 11 Mar 2015 12:24:15 GMT

I think the problem is that dgemm automatically detects whether it is already in a parallel region, and if it is, it refuses to use additional threads, even when suggested.

The following in the basic idea for my matrix multiplication, but the major problem at the moment is that the dgemms each only run on 1 thread even though I just used omp_set_num_threads.

if((numaNodeCount==8).and.(logicalProcessorCount==64)) then !Beck server

        success = SETENVQQ("OMP_PLACES={0,1,2,3,4,5,6,7},{8,9,10,11,12,13,14,15},{16,17,18,19,20,21,22,23},{24,25,26,27,28,29,30,31},{32,33,34,35,36,37,38,39},{40,41,42,43,44,45,46,47},{48,49,50,51,52,53,54,55},{56,57,58,59,60,61,62,63}")
    end if
    call KMP_SET_STACKSIZE_S(990000000)
    blocksize=3000
    NUMA=8
    ThreadsPrNuma=6
    dim=blocksize*Numa
    NumOfBlocks=dim/blocksize
    NUMASize=dim/NUMA
    allocate(bA(dim,dim))
    allocate(bB(dim,dim))
    allocate(bC(dim,dim))
    allocate(bbc(blocksize,blocksize))
    
    call omp_set_num_threads(NUMA)
    !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,first,last)
    !$OMP DO SCHEDULE(STATIC)
    do i=1,NUMA
        first=(i-1)*NUMA+1
        last=i*NUMA
        bA(:,first:last)=0.d0
        bB(first:last,:)=0.d0
        bC(:,first:last)=0.d0
    end do
    !$OMP END DO
    !$OMP END PARALLEL    
    call omp_set_dynamic(1)
    call omp_set_nested(1)
    print*,'Nested?',omp_get_nested()
    print*,'Dynamic?',omp_get_dynamic()
    print*, 'Threads?',omp_get_max_threads()

!Setup complete, now we do the matrix multiplication
  Runtimebegin=omp_get_wtime()
  call omp_set_num_threads(NUMA)
  do i=1,NUMofBlocks
    do j=1,NUMofBlocks
      first_row=(i-1)*blocksize+1
      first_col=(j-1)*blocksize+1
      last_row=i*blocksize
      last_col=j*blocksize
      bbc = bc(first_row:last_row,first_col:last_col)   
      !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(k,first,last) proc_bind(spread) REDUCTION(+:bbc)   
      !$OMP DO SCHEDULE(STATIC)
      do k=1,NUMA
        first=(k-1)*NUMAsize+1
        last=k*NUMAsize
        call omp_set_num_threads(ThreadsPrNuma)
        !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i) proc_bind(master)
        !$OMP END PARALLEL          
        call dgemm('N','N',blocksize,blocksize,NUMAsize,1.d0,bA(first_row:last_row,first:last),dim,bB(first:last,first_col:last_col),dim,0.d0,bbc,blocksize)
      end do
    !$OMP END DO
    !$OMP END PARALLEL
    bc(first_row:last_row,first_col:last_col)=bbc   
    end do
  end do
  Runtimeend=omp_get_wtime()
  print*, 'Matrix multiplication time', runtimeend-runtimebegin

I, perhaps naively, inserted:

        call omp_set_num_threads(ThreadsPrNuma)
        !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i) proc_bind(master)
        !$OMP END PARALLEL

just before the dgemm call in the hope that it would use the number of threads specified and the affinity defined by proc_bind, but unfortunately to no effect.

Is there any way to do this?