<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic chain matrix vector multiplications in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/chain-matrix-vector-multiplications/m-p/790414#M2174</link>
    <description>&lt;P&gt;If you insist in using GEMM from BLAS3, one possibly strategy is to block for B-by-B blocks of R&lt;SUP&gt;T&lt;/SUP&gt; and m-by-B blocks of R. For all such combinations, take the corresponding B-by-m block from A to form part of the product. (If you use the CSR format to store A, this B-by-m block would be contiguous in memory.) The tricky thing is to produce a good choice of B. Probably you want to choose B such that (3 * B * B + m * B) * sizeof(float or double) = L2$ capacity. You may also want to use NTA prefetching to avoid letting A pollute the cache hierarchy. Finally, you can utilize L2$ from multiple cores by threading the algorithm so that each core handles a separate m-by-B block of R. Threading in BLAS is likely to be not worthwhile because B is expected to be quite small.&lt;/P&gt;</description>
    <pubDate>Fri, 16 Mar 2012 01:48:05 GMT</pubDate>
    <dc:creator>styc</dc:creator>
    <dc:date>2012-03-16T01:48:05Z</dc:date>
    <item>
      <title>chain matrix vector multiplications</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/chain-matrix-vector-multiplications/m-p/790411#M2171</link>
      <description>Dear all,&lt;BR /&gt;&lt;BR /&gt;I extensively do some kind of chained block matrix sparse matrix operations a lot, this is a projection in mathematical sense.&lt;BR /&gt;&lt;BR /&gt;Say R is a dense matrix m by n and A is a sparse or dense matrix m by m I was wondering the most efficient way to compute multiplications like&lt;BR /&gt;&lt;BR /&gt;R^T A R&lt;BR /&gt;&lt;BR /&gt;Now I create a temporary matrix for the multiplication result AR and then multiplying the result by R^T from left. &lt;BR /&gt;&lt;BR /&gt;Any ideas to make it more efficient without a temporary?&lt;BR /&gt;&lt;BR /&gt;Best regards,&lt;BR /&gt;Umut</description>
      <pubDate>Wed, 14 Mar 2012 18:36:18 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/chain-matrix-vector-multiplications/m-p/790411#M2171</guid>
      <dc:creator>utab</dc:creator>
      <dc:date>2012-03-14T18:36:18Z</dc:date>
    </item>
    <item>
      <title>chain matrix vector multiplications</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/chain-matrix-vector-multiplications/m-p/790412#M2172</link>
      <description>You can arrange the computation in several mathematically equivalent ways. The alternatives and their implications for computational efficiency are covered in, for example, &lt;I&gt;Matrix Computations&lt;/I&gt; by Golub and van Loan, ISBN-13: 978-0801854149.&lt;BR /&gt;&lt;BR /&gt;For example, you can produce one column of the product A R at a time, and multiply that column by R&lt;SUP&gt;T&lt;/SUP&gt; to obtain the corresponding column of R&lt;SUP&gt;T&lt;/SUP&gt;A R.</description>
      <pubDate>Thu, 15 Mar 2012 09:42:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/chain-matrix-vector-multiplications/m-p/790412#M2172</guid>
      <dc:creator>mecej4</dc:creator>
      <dc:date>2012-03-15T09:42:48Z</dc:date>
    </item>
    <item>
      <title>chain matrix vector multiplications</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/chain-matrix-vector-multiplications/m-p/790413#M2173</link>
      <description>Well ok, but Block BLAS operations are in general much faster and better optimized for architectures, doing one vector at a time might not buy you much. However, I am not sure completely, correct me if I am wrong.</description>
      <pubDate>Thu, 15 Mar 2012 15:38:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/chain-matrix-vector-multiplications/m-p/790413#M2173</guid>
      <dc:creator>utab</dc:creator>
      <dc:date>2012-03-15T15:38:37Z</dc:date>
    </item>
    <item>
      <title>chain matrix vector multiplications</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/chain-matrix-vector-multiplications/m-p/790414#M2174</link>
      <description>&lt;P&gt;If you insist in using GEMM from BLAS3, one possibly strategy is to block for B-by-B blocks of R&lt;SUP&gt;T&lt;/SUP&gt; and m-by-B blocks of R. For all such combinations, take the corresponding B-by-m block from A to form part of the product. (If you use the CSR format to store A, this B-by-m block would be contiguous in memory.) The tricky thing is to produce a good choice of B. Probably you want to choose B such that (3 * B * B + m * B) * sizeof(float or double) = L2$ capacity. You may also want to use NTA prefetching to avoid letting A pollute the cache hierarchy. Finally, you can utilize L2$ from multiple cores by threading the algorithm so that each core handles a separate m-by-B block of R. Threading in BLAS is likely to be not worthwhile because B is expected to be quite small.&lt;/P&gt;</description>
      <pubDate>Fri, 16 Mar 2012 01:48:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/chain-matrix-vector-multiplications/m-p/790414#M2174</guid>
      <dc:creator>styc</dc:creator>
      <dc:date>2012-03-16T01:48:05Z</dc:date>
    </item>
  </channel>
</rss>

