<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic optimizing dgemm on NUMA systems in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/optimizing-dgemm-on-NUMA-systems/m-p/1034107#M20306</link>
    <description>&lt;P&gt;Hello&lt;/P&gt;

&lt;P&gt;I have been trying to optimize matrix multiplication on NUMA systems but so far without much luck.&lt;/P&gt;

&lt;P&gt;I have played around with the dgemm routine and first touch.&lt;/P&gt;

&lt;P&gt;A snippet of my code looks like this:&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;     print*, 'initiating first touch'
    A=0
    B=0
    C=0
    !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i)
    !$OMP DO SCHEDULE(STATIC)
    do i=1,dim
        rA(:,i)=0.d0
        rB(:,i)=0.d0
        rC(:,i)=0.d0
        cA(i,:)=0.d0
        cB(i,:)=0.d0
        cC(i,:)=0.d0
    end do
    !$OMP END DO
    !$OMP END PARALLEL
    
    print*, 'first touch done'
      
    Runtimebegin=omp_get_wtime()
    do i=1,10
    call dgemm('N','N',dim,dim,dim,1.d0,A,dim,B,dim,0.d0,C,dim)
    end do
    Runtimeend=omp_get_wtime()

    print*, 'ABC', runtimeend-runtimebegin

    call amat(rA,y,z,n,m,mm) 
    
    Runtimebegin=omp_get_wtime()
    do i=1,10
    call dgemm('N','N',dim,dim,dim,1.d0,cA,dim,cB,dim,0.d0,cC,dim)
    end do
    Runtimeend=omp_get_wtime()

    print*, 'cAcBcC', runtimeend-runtimebegin
    
    Runtimebegin=omp_get_wtime()
    do i=1,10
    call dgemm('N','N',dim,dim,dim,1.d0,rA,dim,rB,dim,0.d0,rC,dim)
    end do
    Runtimeend=omp_get_wtime()

    print*, 'rArBrC', runtimeend-runtimebegin
    
    
    Runtimebegin=omp_get_wtime()
    do i=1,10
    call dgemm('N','N',dim,dim,dim,1.d0,cA,dim,rB,dim,0.d0,cC,dim)
    end do
    Runtimeend=omp_get_wtime()

    print*, 'cArBcC', runtimeend-runtimebegin
   &lt;/PRE&gt;

&lt;P&gt;From which I get the following results on NUMA architechture.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;ABC &amp;nbsp; 37.0858174243185&lt;BR /&gt;
	&amp;nbsp;cAcBcC &amp;nbsp; 9.20657384615333&lt;BR /&gt;
	&amp;nbsp;rArBrC &amp;nbsp; 9.42917347316688&lt;BR /&gt;
	&amp;nbsp;cArBcC &amp;nbsp; 9.22702269622823&lt;/P&gt;

&lt;P&gt;As we can see, it is important to spread your data across the NUMA system, but how it is spread seems to have little influence, or at least I haven't found an optimal way to spread it.&lt;/P&gt;

&lt;P&gt;So I ask, is there an optimal way to spread my matrices when using dgemm?&lt;/P&gt;

&lt;P&gt;Or if dgemm is not suitable on NUMA, are there any alternatives which handle matrix multiplication faster on NUMA architechture?&lt;/P&gt;

&lt;P&gt;Thanks in advance&lt;/P&gt;

&lt;P&gt;Tue&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Wed, 04 Mar 2015 14:10:48 GMT</pubDate>
    <dc:creator>Tue_B_</dc:creator>
    <dc:date>2015-03-04T14:10:48Z</dc:date>
    <item>
      <title>optimizing dgemm on NUMA systems</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/optimizing-dgemm-on-NUMA-systems/m-p/1034107#M20306</link>
      <description>&lt;P&gt;Hello&lt;/P&gt;

&lt;P&gt;I have been trying to optimize matrix multiplication on NUMA systems but so far without much luck.&lt;/P&gt;

&lt;P&gt;I have played around with the dgemm routine and first touch.&lt;/P&gt;

&lt;P&gt;A snippet of my code looks like this:&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;     print*, 'initiating first touch'
    A=0
    B=0
    C=0
    !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i)
    !$OMP DO SCHEDULE(STATIC)
    do i=1,dim
        rA(:,i)=0.d0
        rB(:,i)=0.d0
        rC(:,i)=0.d0
        cA(i,:)=0.d0
        cB(i,:)=0.d0
        cC(i,:)=0.d0
    end do
    !$OMP END DO
    !$OMP END PARALLEL
    
    print*, 'first touch done'
      
    Runtimebegin=omp_get_wtime()
    do i=1,10
    call dgemm('N','N',dim,dim,dim,1.d0,A,dim,B,dim,0.d0,C,dim)
    end do
    Runtimeend=omp_get_wtime()

    print*, 'ABC', runtimeend-runtimebegin

    call amat(rA,y,z,n,m,mm) 
    
    Runtimebegin=omp_get_wtime()
    do i=1,10
    call dgemm('N','N',dim,dim,dim,1.d0,cA,dim,cB,dim,0.d0,cC,dim)
    end do
    Runtimeend=omp_get_wtime()

    print*, 'cAcBcC', runtimeend-runtimebegin
    
    Runtimebegin=omp_get_wtime()
    do i=1,10
    call dgemm('N','N',dim,dim,dim,1.d0,rA,dim,rB,dim,0.d0,rC,dim)
    end do
    Runtimeend=omp_get_wtime()

    print*, 'rArBrC', runtimeend-runtimebegin
    
    
    Runtimebegin=omp_get_wtime()
    do i=1,10
    call dgemm('N','N',dim,dim,dim,1.d0,cA,dim,rB,dim,0.d0,cC,dim)
    end do
    Runtimeend=omp_get_wtime()

    print*, 'cArBcC', runtimeend-runtimebegin
   &lt;/PRE&gt;

&lt;P&gt;From which I get the following results on NUMA architechture.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;ABC &amp;nbsp; 37.0858174243185&lt;BR /&gt;
	&amp;nbsp;cAcBcC &amp;nbsp; 9.20657384615333&lt;BR /&gt;
	&amp;nbsp;rArBrC &amp;nbsp; 9.42917347316688&lt;BR /&gt;
	&amp;nbsp;cArBcC &amp;nbsp; 9.22702269622823&lt;/P&gt;

&lt;P&gt;As we can see, it is important to spread your data across the NUMA system, but how it is spread seems to have little influence, or at least I haven't found an optimal way to spread it.&lt;/P&gt;

&lt;P&gt;So I ask, is there an optimal way to spread my matrices when using dgemm?&lt;/P&gt;

&lt;P&gt;Or if dgemm is not suitable on NUMA, are there any alternatives which handle matrix multiplication faster on NUMA architechture?&lt;/P&gt;

&lt;P&gt;Thanks in advance&lt;/P&gt;

&lt;P&gt;Tue&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 04 Mar 2015 14:10:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/optimizing-dgemm-on-NUMA-systems/m-p/1034107#M20306</guid>
      <dc:creator>Tue_B_</dc:creator>
      <dc:date>2015-03-04T14:10:48Z</dc:date>
    </item>
    <item>
      <title>Your stride 1 array</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/optimizing-dgemm-on-NUMA-systems/m-p/1034108#M20307</link>
      <description>&lt;P&gt;Your stride 1 array initialization ought to engage first touch locality optimization if you would set appropriate affinity by omp_proc_bind or kMP_affinity. &amp;nbsp;I doubt it can be done for large stride.&lt;/P&gt;

&lt;P&gt;mkL dgemm is written so as to minimize memory references so it may not depend strongly on locality. &amp;nbsp;It has been demonstrated efficient on dual CPU.&lt;/P&gt;</description>
      <pubDate>Wed, 04 Mar 2015 15:39:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/optimizing-dgemm-on-NUMA-systems/m-p/1034108#M20307</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2015-03-04T15:39:25Z</dc:date>
    </item>
    <item>
      <title>Hello,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/optimizing-dgemm-on-NUMA-systems/m-p/1034109#M20308</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;

&lt;P&gt;&amp;nbsp;Have you tried numactl? You can specify interleaved memory mapping with numactl command.&lt;/P&gt;

&lt;P&gt;$&amp;gt; numactl -i all &amp;lt;your binary&amp;gt;&lt;/P&gt;

&lt;P&gt;Thanks,&lt;/P&gt;

&lt;P&gt;&amp;nbsp;Kazushige Goto&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 06 Mar 2015 19:19:12 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/optimizing-dgemm-on-NUMA-systems/m-p/1034109#M20308</guid>
      <dc:creator>Kazushige_G_Intel</dc:creator>
      <dc:date>2015-03-06T19:19:12Z</dc:date>
    </item>
    <item>
      <title>Hey</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/optimizing-dgemm-on-NUMA-systems/m-p/1034110#M20309</link>
      <description>&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;Hey&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;Thanks for the input Tim, so far my tests have also shown that dgemm isn't dependent on locality.&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;numactl sounds like an easy way to control memory mapping, unfortunately I'm on windows and as far as I can see there isn't anything similar on there.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;Does anyone know whether it is possible to specify the number of threads dgemm should use?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;If I do something like &amp;nbsp;the one below on my 64 core machine, I notice that dgemm still only uses 32 of the cores.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;    call omp_set_num_threads(64)    
    Runtimebegin=omp_get_wtime()
    do i=1,100
    call dgemm('N','N',dim,dim,dim,1.d0,cA,dim,cB,dim,0.d0,cC,dim)
    end do
    Runtimeend=omp_get_wtime()
&lt;/PRE&gt;

&lt;P&gt;And even worse is if I do something like the one below here, where NUMA=8 and ThreadsPrNUMA=7, then I see it only uses 8 threads in total, instead of the 56 I would expect.&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;
  call omp_set_num_threads(NUMA)
  !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,first,last) REDUCTION(+:bbc)   
    !$OMP DO SCHEDULE(STATIC)
    do i=1,NUMA
      first=(i-1)*NUMAsize+1
       call omp_set_num_threads(ThreadsPrNUMA)
       call dgemm('N','N',blocksize,blocksize,NUMAsize,1.d0,bA(1,first),dim,bB(first,1),dim,0.d0,bbC,blocksize)
    end do
    !$OMP END DO
    !$OMP END PARALLEL
&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 10 Mar 2015 10:42:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/optimizing-dgemm-on-NUMA-systems/m-p/1034110#M20309</guid>
      <dc:creator>Tue_B_</dc:creator>
      <dc:date>2015-03-10T10:42:32Z</dc:date>
    </item>
    <item>
      <title>Apologies if I double post, I</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/optimizing-dgemm-on-NUMA-systems/m-p/1034111#M20310</link>
      <description>&lt;P&gt;Apologies if I double post, I was logged out while posting.&lt;/P&gt;

&lt;P&gt;Up to my retirement, only certain oem versions of Windows supported as many as 64 threads shared memory. A normal upgrade would lose that capability.&lt;/P&gt;

&lt;P&gt;You might get some use from mkl_dynamic and mkl_num_threads, maybe omp_nested.&lt;/P&gt;

&lt;P&gt;your inner omp_set_num_threads would take &amp;nbsp;effect in the next parallel region if you didn't override it.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 10 Mar 2015 13:39:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/optimizing-dgemm-on-NUMA-systems/m-p/1034111#M20310</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2015-03-10T13:39:32Z</dc:date>
    </item>
    <item>
      <title>I think the problem is that</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/optimizing-dgemm-on-NUMA-systems/m-p/1034112#M20311</link>
      <description>&lt;P&gt;I think the problem is that dgemm automatically detects whether it is already in a parallel region, and if it is, it refuses to use additional threads, even when suggested.&lt;/P&gt;

&lt;P&gt;The following in the basic idea for my matrix multiplication, but the major problem at the moment is that the dgemms each only run on 1 thread even though I just used omp_set_num_threads.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-family: Consolas, 'Lucida Console', Menlo, Monaco, 'DejaVu Sans Mono', monospace, sans-serif; font-size: 1em; line-height: 1.5;"&gt;&amp;nbsp; &amp;nbsp; if((numaNodeCount==8).and.(logicalProcessorCount==64)) then !Beck server&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;        success = SETENVQQ("OMP_PLACES={0,1,2,3,4,5,6,7},{8,9,10,11,12,13,14,15},{16,17,18,19,20,21,22,23},{24,25,26,27,28,29,30,31},{32,33,34,35,36,37,38,39},{40,41,42,43,44,45,46,47},{48,49,50,51,52,53,54,55},{56,57,58,59,60,61,62,63}")
    end if
    call KMP_SET_STACKSIZE_S(990000000)
    blocksize=3000
    NUMA=8
    ThreadsPrNuma=6
    dim=blocksize*Numa
    NumOfBlocks=dim/blocksize
    NUMASize=dim/NUMA
    allocate(bA(dim,dim))
    allocate(bB(dim,dim))
    allocate(bC(dim,dim))
    allocate(bbc(blocksize,blocksize))
    
    call omp_set_num_threads(NUMA)
    !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,first,last)
    !$OMP DO SCHEDULE(STATIC)
    do i=1,NUMA
        first=(i-1)*NUMA+1
        last=i*NUMA
        bA(:,first:last)=0.d0
        bB(first:last,:)=0.d0
        bC(:,first:last)=0.d0
    end do
    !$OMP END DO
    !$OMP END PARALLEL    
    call omp_set_dynamic(1)
    call omp_set_nested(1)
    print*,'Nested?',omp_get_nested()
    print*,'Dynamic?',omp_get_dynamic()
    print*, 'Threads?',omp_get_max_threads()

!Setup complete, now we do the matrix multiplication
  Runtimebegin=omp_get_wtime()
  call omp_set_num_threads(NUMA)
  do i=1,NUMofBlocks
    do j=1,NUMofBlocks
      first_row=(i-1)*blocksize+1
      first_col=(j-1)*blocksize+1
      last_row=i*blocksize
      last_col=j*blocksize
      bbc = bc(first_row:last_row,first_col:last_col)   
      !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(k,first,last) proc_bind(spread) REDUCTION(+:bbc)   
      !$OMP DO SCHEDULE(STATIC)
      do k=1,NUMA
        first=(k-1)*NUMAsize+1
        last=k*NUMAsize
        call omp_set_num_threads(ThreadsPrNuma)
        !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i) proc_bind(master)
        !$OMP END PARALLEL          
        call dgemm('N','N',blocksize,blocksize,NUMAsize,1.d0,bA(first_row:last_row,first:last),dim,bB(first:last,first_col:last_col),dim,0.d0,bbc,blocksize)
      end do
    !$OMP END DO
    !$OMP END PARALLEL
    bc(first_row:last_row,first_col:last_col)=bbc   
    end do
  end do
  Runtimeend=omp_get_wtime()
  print*, 'Matrix multiplication time', runtimeend-runtimebegin
&lt;/PRE&gt;

&lt;P&gt;I, perhaps naively, inserted:&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;        call omp_set_num_threads(ThreadsPrNuma)
        !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i) proc_bind(master)
        !$OMP END PARALLEL          
&lt;/PRE&gt;

&lt;P&gt;just before the dgemm call in the hope that it would use the number of threads specified and the affinity defined by proc_bind, but unfortunately to no effect.&lt;/P&gt;

&lt;P&gt;Is there any way to do this?&lt;/P&gt;</description>
      <pubDate>Wed, 11 Mar 2015 12:24:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/optimizing-dgemm-on-NUMA-systems/m-p/1034112#M20311</guid>
      <dc:creator>Tue_B_</dc:creator>
      <dc:date>2015-03-11T12:24:15Z</dc:date>
    </item>
  </channel>
</rss>

