<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic For example, in case of an in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/minimum-optimal-block-size-for-ScaLAPACK-and-BLAS/m-p/954492#M15415</link>
    <description>For example, in case of an Ivy Bridge system, like:

Intel Core i7-3840QM ( 2.80 GHz )
Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846

Size of L3 Cache = 8MB   ( shared between all cores for data &amp;amp; instructions )
Size of L2 Cache = 1MB   ( 256KB per core / shared for data &amp;amp; instructions )
Size of L1 Cache = 256KB ( 32KB per core for data &amp;amp; 32KB per core for instructions )

an optimal size depends on sizes of these cache lines and you need to take into account Lx sizes for your system.

Also, there was a post recently that in case of a Haswell system a minimal block size for some memory bound processing, like copy from a memory location A to location B, is &lt;STRONG&gt;1920 bytes&lt;/STRONG&gt; ( 64 * 30 ) and it was selected after a series of tests.</description>
    <pubDate>Fri, 19 Jul 2013 13:07:19 GMT</pubDate>
    <dc:creator>SergeyKostrov</dc:creator>
    <dc:date>2013-07-19T13:07:19Z</dc:date>
    <item>
      <title>minimum / optimal block size for ScaLAPACK and BLAS?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/minimum-optimal-block-size-for-ScaLAPACK-and-BLAS/m-p/954490#M15413</link>
      <description>&lt;P&gt;ScaLAPACK arrays are distributed in a block-cyclic fashion over the process "grid".&amp;nbsp; ScaLAPACK then uses the PBLAS and BLACS to perform BLAS-like operations, but in a distributed SPMD fashion, which become a mix of communication between processes, and BLAS operations within the processes, more-or-less.&lt;/P&gt;
&lt;P&gt;So the size of the block is going to affect the performance of the communication and the BLAS calls, but the degree to which it does depends on the implementation.&amp;nbsp; The MKL implementation is a black-box to the end user (me).&amp;nbsp; And I don't have an ATLAS-like search tool to point me in the right direction toward what block size I should be using, especially when the parameters are things like { Gig-ethernet vs 10G infiniband vs ....} and {westmere vs sandy/ivy-bridge vs haswell } etc.&lt;/P&gt;
&lt;P&gt;So... are there any guidelines for choice of block size when using MKL ScaLAPACK, LAPACK, and BLAS ?&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;E.g. is it important for the ScaLAPACK block size to be a cache-friendly size (e.g no larger than 1/2 of L1 or L2, etc)?&lt;/LI&gt;
&lt;LI&gt;Or alternatively does the ScaLAPACK block size affect primarily the load balancing as an operation that works on successively smaller areas of a matrix as many of the algorithms do?&amp;nbsp; But are not relevant to the efficiency of block-matrix multiplies at the BLAS level?&lt;/LI&gt;
&lt;LI&gt;Perhaps the MKL Level-3 BLAS calls are themselves made less-sensitive to large block sizes?&amp;nbsp; (E.g. because there is re-blocking within gemm() etc, anyway where threads are exploited by OpenMP etc ... maybe the MKL BLAS is already subdividing (re-blocking) to be as effcient as it can given that it gets a large enough block?&lt;/LI&gt;
&lt;LI&gt;If I want to avoid such hypothesized re-blocking, because for some reason there are places where I can manage this block size "for free" as a side-effect of the way my code is structured, is there an optimal block size for MKL level-3 BLAS calls?&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 19 Jul 2013 04:01:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/minimum-optimal-block-size-for-ScaLAPACK-and-BLAS/m-p/954490#M15413</guid>
      <dc:creator>Jas_Mcqueston</dc:creator>
      <dc:date>2013-07-19T04:01:33Z</dc:date>
    </item>
    <item>
      <title>Sorry for some sloppiness in</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/minimum-optimal-block-size-for-ScaLAPACK-and-BLAS/m-p/954491#M15414</link>
      <description>&lt;P&gt;Sorry for some sloppiness in my writing.&amp;nbsp; If I knew how to edit my post, I would make these changes:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;[in bullet 2] "load balancing as an operation" -&amp;gt; &lt;STRONG&gt;of &lt;/STRONG&gt;an operation&lt;/LI&gt;
&lt;LI&gt;[in bullet 3] "because there is ... OpenMP etc" &lt;STRONG&gt;&lt;/STRONG&gt;-&amp;gt; because &lt;STRONG&gt;maybe &lt;/STRONG&gt;there is&amp;nbsp; ... OpenMP etc&lt;STRONG&gt;?&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI&gt;[in bullet 4] "because for some reason" -&amp;gt; because &lt;STRONG&gt;if &lt;/STRONG&gt;for some reason&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Fri, 19 Jul 2013 05:43:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/minimum-optimal-block-size-for-ScaLAPACK-and-BLAS/m-p/954491#M15414</guid>
      <dc:creator>Jas_Mcqueston</dc:creator>
      <dc:date>2013-07-19T05:43:55Z</dc:date>
    </item>
    <item>
      <title>For example, in case of an</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/minimum-optimal-block-size-for-ScaLAPACK-and-BLAS/m-p/954492#M15415</link>
      <description>For example, in case of an Ivy Bridge system, like:

Intel Core i7-3840QM ( 2.80 GHz )
Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846

Size of L3 Cache = 8MB   ( shared between all cores for data &amp;amp; instructions )
Size of L2 Cache = 1MB   ( 256KB per core / shared for data &amp;amp; instructions )
Size of L1 Cache = 256KB ( 32KB per core for data &amp;amp; 32KB per core for instructions )

an optimal size depends on sizes of these cache lines and you need to take into account Lx sizes for your system.

Also, there was a post recently that in case of a Haswell system a minimal block size for some memory bound processing, like copy from a memory location A to location B, is &lt;STRONG&gt;1920 bytes&lt;/STRONG&gt; ( 64 * 30 ) and it was selected after a series of tests.</description>
      <pubDate>Fri, 19 Jul 2013 13:07:19 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/minimum-optimal-block-size-for-ScaLAPACK-and-BLAS/m-p/954492#M15415</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-07-19T13:07:19Z</dc:date>
    </item>
  </channel>
</rss>

