<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Performance decrease of BLAS function for large matrices in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-decrease-of-BLAS-function-for-large-matrices/m-p/798693#M2895</link>
    <description>&lt;SPAN style="font-size: small;"&gt;&lt;FONT size="3"&gt;&lt;P&gt;For the big matrixes speed of RAM is important (forlevel 2 Blas)&lt;/P&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;P&gt;&lt;SPAN style="font-size: small;"&gt;.&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Sun, 10 Jul 2011 18:52:40 GMT</pubDate>
    <dc:creator>yuriisig</dc:creator>
    <dc:date>2011-07-10T18:52:40Z</dc:date>
    <item>
      <title>Performance decrease of BLAS function for large matrices</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-decrease-of-BLAS-function-for-large-matrices/m-p/798689#M2891</link>
      <description>&lt;P lang="en-US"&gt;Hello everyone!&lt;BR /&gt;&lt;BR /&gt;Has anyone got any ideas what
might cause the drop in performance of the BLAS function GEMV() when
compared to a simple serial computation of the same problem?&lt;BR /&gt;&lt;BR /&gt;Let
me explain my question more clearly.&lt;BR /&gt;I've written a program that
compares the performance of GEMV() to a simple serial matrix-vector
multiplication routine. Each routine (serial one and GEMV()) is
called 100000 times and the total time needed for the computations is
recorded in a text file. This is done to simulate a program that uses
an iterative method of finding voltages and currents in an inductive
network.&lt;BR /&gt;With a matrix size of 1000X1000 GEMV() performs
approximately 3.3 times as fast (using 4 cores) as the serial
version.&lt;/P&gt;
&lt;P lang="en-US"&gt;But with increasing matrix size this performance
increase decreases considerably.&lt;BR /&gt;For a 1500x1500 matrix GEMV()
performs ~ 1.7 times as fast as the serial computation&lt;BR /&gt;and for a
2000x2000 matrix GEMV() using 4 cores takes about the same amount of time as the serial computation. 
&lt;/P&gt;
&lt;P lang="en-US"&gt;What is causing this behavior? Has it got something
to do with cache, memory access patterns or something completely
different? Any ideas what might be causing this and any suggestions
on how to keep the performance up for large matrices would be greatly
appreciated.&lt;/P&gt;
&lt;P lang="en-US"&gt;Gregor Seitlinger&lt;/P&gt;</description>
      <pubDate>Tue, 28 Jun 2011 16:03:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-decrease-of-BLAS-function-for-large-matrices/m-p/798689#M2891</guid>
      <dc:creator>gregor_seitlinger</dc:creator>
      <dc:date>2011-06-28T16:03:24Z</dc:date>
    </item>
    <item>
      <title>Performance decrease of BLAS function for large matrices</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-decrease-of-BLAS-function-for-large-matrices/m-p/798690#M2892</link>
      <description>You left out some significant details. For example, what are the cache sizes? Do you have two levels or three and, if three, is the L-3 cache shared by all the cores? Do you compute &lt;I&gt;M v&lt;/I&gt; or &lt;I&gt;M' v&lt;/I&gt; with the call to GEMV?&lt;BR /&gt;&lt;BR /&gt;A 2000 X 2000 dense matrix would occupy 16 or 32 Mbytes, which probably is more than what you have in L-2 or L-3 cache.&lt;BR /&gt;&lt;BR /&gt;Are you surprised by the timing results as a result of expecting linear speed-up according to the number of threads? &lt;BR /&gt;&lt;BR /&gt;Amdahl's "law" has something to say about how much speed-up to expect, not just by using parallel programming, but by dedicating more resources in general.&lt;BR /&gt;&lt;BR /&gt;There is an excellent review of the issues in &lt;A href="http://supertech.csail.mit.edu/cilk/minicourse.ps.gz"&gt;A minicourse on multithreaded programming&lt;/A&gt;.</description>
      <pubDate>Fri, 08 Jul 2011 14:17:44 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-decrease-of-BLAS-function-for-large-matrices/m-p/798690#M2892</guid>
      <dc:creator>mecej4</dc:creator>
      <dc:date>2011-07-08T14:17:44Z</dc:date>
    </item>
    <item>
      <title>Performance decrease of BLAS function for large matrices</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-decrease-of-BLAS-function-for-large-matrices/m-p/798691#M2893</link>
      <description>Level (2!!!!!) Blas routine</description>
      <pubDate>Sun, 10 Jul 2011 15:21:26 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-decrease-of-BLAS-function-for-large-matrices/m-p/798691#M2893</guid>
      <dc:creator>yuriisig</dc:creator>
      <dc:date>2011-07-10T15:21:26Z</dc:date>
    </item>
    <item>
      <title>Performance decrease of BLAS function for large matrices</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-decrease-of-BLAS-function-for-large-matrices/m-p/798692#M2894</link>
      <description>At issue is cache level, not BLAS level.</description>
      <pubDate>Sun, 10 Jul 2011 16:16:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-decrease-of-BLAS-function-for-large-matrices/m-p/798692#M2894</guid>
      <dc:creator>mecej4</dc:creator>
      <dc:date>2011-07-10T16:16:39Z</dc:date>
    </item>
    <item>
      <title>Performance decrease of BLAS function for large matrices</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-decrease-of-BLAS-function-for-large-matrices/m-p/798693#M2895</link>
      <description>&lt;SPAN style="font-size: small;"&gt;&lt;FONT size="3"&gt;&lt;P&gt;For the big matrixes speed of RAM is important (forlevel 2 Blas)&lt;/P&gt;&lt;/FONT&gt;&lt;/SPAN&gt;&lt;P&gt;&lt;SPAN style="font-size: small;"&gt;.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 10 Jul 2011 18:52:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-decrease-of-BLAS-function-for-large-matrices/m-p/798693#M2895</guid>
      <dc:creator>yuriisig</dc:creator>
      <dc:date>2011-07-10T18:52:40Z</dc:date>
    </item>
    <item>
      <title>Performance decrease of BLAS function for large matrices</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-decrease-of-BLAS-function-for-large-matrices/m-p/798694#M2896</link>
      <description>Thanks for all your replies!&lt;BR /&gt;&lt;BR /&gt;It of course turns out that it is a cache issue, the larger matrices are not fitting into the L2 cache (8MB) anymore and that is the reason for the decrease in speed.&lt;BR /&gt;&lt;BR /&gt;Guess my best option is to use some sort of divide-and-conquer approach to get some performance back.&lt;BR /&gt;Hopefully I can block the matrix without so much overhead that the performance i gain from working on smaller matrices is eaten up by the blocking code.&lt;BR /&gt;&lt;BR /&gt;Gregor Seitlinger</description>
      <pubDate>Sun, 10 Jul 2011 19:53:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-decrease-of-BLAS-function-for-large-matrices/m-p/798694#M2896</guid>
      <dc:creator>gregor_seitlinger</dc:creator>
      <dc:date>2011-07-10T19:53:05Z</dc:date>
    </item>
    <item>
      <title>Performance decrease of BLAS function for large matrices</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-decrease-of-BLAS-function-for-large-matrices/m-p/798695#M2897</link>
      <description>You may be able to gain more headroom if the large matrices are, say, only 30 percent full and you use sparse matrix techniques, for which there is support in MKL. Even though the coding is somewhat harder, as is debugging, at the end you will be able to fit (conceptually) larger matrices into cache.</description>
      <pubDate>Sun, 10 Jul 2011 23:50:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Performance-decrease-of-BLAS-function-for-large-matrices/m-p/798695#M2897</guid>
      <dc:creator>mecej4</dc:creator>
      <dc:date>2011-07-10T23:50:15Z</dc:date>
    </item>
  </channel>
</rss>

