<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic GEMM and First Touch Memory Allocation in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/GEMM-and-First-Touch-Memory-Allocation/m-p/1126885#M7633</link>
    <description>&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;I am doing some experiments with various implementations that compute the GEMM algorithm C &amp;lt;- alpha*AB + beta*C.&amp;nbsp; One topic that did come up is if using memory locality had any impact in performance on a Sandy Bridge Xeon node.&amp;nbsp; The following is one implementation of GEMM (I have many others, including MKL's SGEMM(...)):&lt;/SPAN&gt;&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;call RANDOM_NUMBER(A)
call RANDOM_NUMBER(B)
call RANDOM_NUMBER(C)

!$OMP PARALLEL DO SHARED(A,B,C) PRIVATE(i,j,l,sum)
do j = 1, n
  do i = 1, m
    sum = 0
    do l = 1, k
      sum = sum + A(i,l)*B(l,j)
    enddo
    C(i,j) = alpha*sum + beta*C(i,j)
  enddo
enddo
!$OMP END PARALLEL DO&lt;/PRE&gt;

&lt;P&gt;This code tries to use stride-1 memory when possible, but it clearly isn't possible for matrix A.&amp;nbsp; Since the random_number(:) is called outside a parallel region, all 3 matrices are located on thread 0.&amp;nbsp; This should make memory access quite expensive, especially when using threads from a different socket (ie threads 8-15 when using 16 threads).&lt;/P&gt;

&lt;P&gt;Some observations about how the loop works:&lt;/P&gt;

&lt;P&gt;1) Matrix A is needed by every thread in full, as it has no dependence on loop J (ie the loop being distributed to all threads)&lt;/P&gt;

&lt;P&gt;2) Only columns of B and C are needed on each thread&lt;/P&gt;

&lt;P&gt;A potential solution using First Touch using static scheduling option could be this:&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;!$OMP PARALLEL SHARED(A,B,C,m,n) PRIVATE(i,j)

! Matrix A is needed on every thread, so we spread out rows of A evenly to all threads
!$OMP DO
do i = 1, m
    call RANDOM_NUMBER( A(i,:) )
end do
!$OMP END DO

! Only columns of B and C are needed on each thread
!$OMP DO
do j = 1, n
  call RANDOM_NUMBER( B(:,j) )
  call RANDOM_NUMBER( C(:,j) ) 
end do
!$OMP END DO

!$OMP END PARALLEL&lt;/PRE&gt;

&lt;P&gt;... Continue algorithm ...&lt;/P&gt;

&lt;P&gt;Giving the threads near perfect access of 2/3 matrices hardly had any performance impact. I observed a 1-2% difference in execution times, which isn't conclusive that the First Touch allocation is doing anything.&amp;nbsp; Even when we tried the same thing with a pre-transposed matrix A to get stride-1 access when computing the matrix multiply like:&lt;/P&gt;

&lt;P&gt;Atr = TRANSPOSE(A)&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;! now use Atr(l, i) instead of A(i,l) in the do loop&lt;/P&gt;

&lt;P&gt;we got similar percentage differences in run times.&amp;nbsp; Experiments were attempted on square matrices A,B,C with the size of 5000x5000.&lt;/P&gt;

&lt;P&gt;Does anyone have any insight as to what can be done?&lt;/P&gt;</description>
    <pubDate>Mon, 22 Jan 2018 18:48:48 GMT</pubDate>
    <dc:creator>Groth__Brandon</dc:creator>
    <dc:date>2018-01-22T18:48:48Z</dc:date>
    <item>
      <title>GEMM and First Touch Memory Allocation</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/GEMM-and-First-Touch-Memory-Allocation/m-p/1126885#M7633</link>
      <description>&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;I am doing some experiments with various implementations that compute the GEMM algorithm C &amp;lt;- alpha*AB + beta*C.&amp;nbsp; One topic that did come up is if using memory locality had any impact in performance on a Sandy Bridge Xeon node.&amp;nbsp; The following is one implementation of GEMM (I have many others, including MKL's SGEMM(...)):&lt;/SPAN&gt;&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;call RANDOM_NUMBER(A)
call RANDOM_NUMBER(B)
call RANDOM_NUMBER(C)

!$OMP PARALLEL DO SHARED(A,B,C) PRIVATE(i,j,l,sum)
do j = 1, n
  do i = 1, m
    sum = 0
    do l = 1, k
      sum = sum + A(i,l)*B(l,j)
    enddo
    C(i,j) = alpha*sum + beta*C(i,j)
  enddo
enddo
!$OMP END PARALLEL DO&lt;/PRE&gt;

&lt;P&gt;This code tries to use stride-1 memory when possible, but it clearly isn't possible for matrix A.&amp;nbsp; Since the random_number(:) is called outside a parallel region, all 3 matrices are located on thread 0.&amp;nbsp; This should make memory access quite expensive, especially when using threads from a different socket (ie threads 8-15 when using 16 threads).&lt;/P&gt;

&lt;P&gt;Some observations about how the loop works:&lt;/P&gt;

&lt;P&gt;1) Matrix A is needed by every thread in full, as it has no dependence on loop J (ie the loop being distributed to all threads)&lt;/P&gt;

&lt;P&gt;2) Only columns of B and C are needed on each thread&lt;/P&gt;

&lt;P&gt;A potential solution using First Touch using static scheduling option could be this:&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;!$OMP PARALLEL SHARED(A,B,C,m,n) PRIVATE(i,j)

! Matrix A is needed on every thread, so we spread out rows of A evenly to all threads
!$OMP DO
do i = 1, m
    call RANDOM_NUMBER( A(i,:) )
end do
!$OMP END DO

! Only columns of B and C are needed on each thread
!$OMP DO
do j = 1, n
  call RANDOM_NUMBER( B(:,j) )
  call RANDOM_NUMBER( C(:,j) ) 
end do
!$OMP END DO

!$OMP END PARALLEL&lt;/PRE&gt;

&lt;P&gt;... Continue algorithm ...&lt;/P&gt;

&lt;P&gt;Giving the threads near perfect access of 2/3 matrices hardly had any performance impact. I observed a 1-2% difference in execution times, which isn't conclusive that the First Touch allocation is doing anything.&amp;nbsp; Even when we tried the same thing with a pre-transposed matrix A to get stride-1 access when computing the matrix multiply like:&lt;/P&gt;

&lt;P&gt;Atr = TRANSPOSE(A)&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;! now use Atr(l, i) instead of A(i,l) in the do loop&lt;/P&gt;

&lt;P&gt;we got similar percentage differences in run times.&amp;nbsp; Experiments were attempted on square matrices A,B,C with the size of 5000x5000.&lt;/P&gt;

&lt;P&gt;Does anyone have any insight as to what can be done?&lt;/P&gt;</description>
      <pubDate>Mon, 22 Jan 2018 18:48:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/GEMM-and-First-Touch-Memory-Allocation/m-p/1126885#M7633</guid>
      <dc:creator>Groth__Brandon</dc:creator>
      <dc:date>2018-01-22T18:48:48Z</dc:date>
    </item>
    <item>
      <title>You're probably getting</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/GEMM-and-First-Touch-Memory-Allocation/m-p/1126886#M7634</link>
      <description>&lt;P&gt;You're probably getting fairly effective use of cache, so it's not surprising that first touch doesn't have much effect.&amp;nbsp; It may be difficult to get more than 50% of peak performance with straightforward source code; first touch will have less relative effect if you don't use an optimized library version.&amp;nbsp; I guess you're saying it's clear you aren't using a stride 1 inner loop, but normal loop interchanges allow that to happen, so to me it's clear you aren't preventing stride 1 unless you set compiler options for that purpose.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;As blocking strategies depend on matrix size, even an expert can't guess what you should do without that information.&amp;nbsp; Even MKL now expects you to choose a different implementation for small matrices from the one for medium and large sizes which chooses blocking strategies according to problem size.&lt;/P&gt;

&lt;P&gt;The "modern code" strategy expects you to use appropriate libraries rather than spend your time reinventing wheels.&lt;/P&gt;</description>
      <pubDate>Mon, 22 Jan 2018 23:13:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/GEMM-and-First-Touch-Memory-Allocation/m-p/1126886#M7634</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2018-01-22T23:13:57Z</dc:date>
    </item>
    <item>
      <title>I have not seen much impact</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/GEMM-and-First-Touch-Memory-Allocation/m-p/1126887#M7635</link>
      <description>&lt;P&gt;I have not seen much impact of poor first-touch affinity on Sandy Bridge, but have definitely seen it on Haswell and Skylake Xeon.&lt;/P&gt;

&lt;P&gt;On Linux, I have found the following command to be adequate to obtain good scaling on Haswell and Skylake Xeon:&lt;/P&gt;

&lt;BLOCKQUOTE&gt;
	&lt;P&gt;numactl --interleave=0,1 ./dgemm_test 5000 5000 5000&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 23 Jan 2018 23:45:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/GEMM-and-First-Touch-Memory-Allocation/m-p/1126887#M7635</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2018-01-23T23:45:32Z</dc:date>
    </item>
  </channel>
</rss>

