<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic matrix multiplication speedup in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/matrix-multiplication-speedup/m-p/1022175#M19730</link>
    <description>&lt;P&gt;&lt;SPAN style="color: rgb(0, 0, 0); font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; line-height: normal;"&gt;Hi,&lt;/SPAN&gt;&lt;BR style="color: rgb(0, 0, 0); font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; line-height: normal;" /&gt;
	&lt;BR style="color: rgb(0, 0, 0); font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; line-height: normal;" /&gt;
	&lt;SPAN style="color: rgb(0, 0, 0); font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; line-height: normal;"&gt;I'm using cblas_dgemm to calculate matrix multiplication. For random generated matrix X of size N * N (N could be 100), &amp;nbsp;I calculate Y = X^T * X. (X^T is the tranpose of X). I can do it in two ways: (1) using cblas_dgemm to calculate Y directly (2) using a forloop that for i = 1:N, Y += X&lt;I&gt; * X&lt;I&gt;^T, where X&lt;I&gt; is the i_th column of X.&amp;nbsp;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/SPAN&gt;&lt;BR style="color: rgb(0, 0, 0); font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; line-height: normal;" /&gt;
	&lt;BR style="color: rgb(0, 0, 0); font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; line-height: normal;" /&gt;
	&lt;SPAN style="color: rgb(0, 0, 0); font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; line-height: normal;"&gt;By comparing the speed, theoretically, they should have same complexity of N^3. But in reality, (2) way might take 4 times longer than (1). Could you help me to understand this?&lt;/SPAN&gt;&lt;BR style="color: rgb(0, 0, 0); font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; line-height: normal;" /&gt;
	&lt;BR style="color: rgb(0, 0, 0); font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; line-height: normal;" /&gt;
	&lt;SPAN style="color: rgb(0, 0, 0); font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; line-height: normal;"&gt;Thanks&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 12 Dec 2014 20:03:39 GMT</pubDate>
    <dc:creator>Bowen_M_</dc:creator>
    <dc:date>2014-12-12T20:03:39Z</dc:date>
    <item>
      <title>matrix multiplication speedup</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/matrix-multiplication-speedup/m-p/1022175#M19730</link>
      <description>&lt;P&gt;&lt;SPAN style="color: rgb(0, 0, 0); font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; line-height: normal;"&gt;Hi,&lt;/SPAN&gt;&lt;BR style="color: rgb(0, 0, 0); font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; line-height: normal;" /&gt;
	&lt;BR style="color: rgb(0, 0, 0); font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; line-height: normal;" /&gt;
	&lt;SPAN style="color: rgb(0, 0, 0); font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; line-height: normal;"&gt;I'm using cblas_dgemm to calculate matrix multiplication. For random generated matrix X of size N * N (N could be 100), &amp;nbsp;I calculate Y = X^T * X. (X^T is the tranpose of X). I can do it in two ways: (1) using cblas_dgemm to calculate Y directly (2) using a forloop that for i = 1:N, Y += X&lt;I&gt; * X&lt;I&gt;^T, where X&lt;I&gt; is the i_th column of X.&amp;nbsp;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/SPAN&gt;&lt;BR style="color: rgb(0, 0, 0); font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; line-height: normal;" /&gt;
	&lt;BR style="color: rgb(0, 0, 0); font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; line-height: normal;" /&gt;
	&lt;SPAN style="color: rgb(0, 0, 0); font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; line-height: normal;"&gt;By comparing the speed, theoretically, they should have same complexity of N^3. But in reality, (2) way might take 4 times longer than (1). Could you help me to understand this?&lt;/SPAN&gt;&lt;BR style="color: rgb(0, 0, 0); font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; line-height: normal;" /&gt;
	&lt;BR style="color: rgb(0, 0, 0); font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; line-height: normal;" /&gt;
	&lt;SPAN style="color: rgb(0, 0, 0); font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; line-height: normal;"&gt;Thanks&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 12 Dec 2014 20:03:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/matrix-multiplication-speedup/m-p/1022175#M19730</guid>
      <dc:creator>Bowen_M_</dc:creator>
      <dc:date>2014-12-12T20:03:39Z</dc:date>
    </item>
    <item>
      <title>In the first case of blas</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/matrix-multiplication-speedup/m-p/1022176#M19731</link>
      <description>&lt;P&gt;In the first case of blas dgemm, there are multiple optimizations techniques are used, that include loop reordering, loop unrolling, subdividing into blocks, vectorization, parallelizations etc.&amp;nbsp; These help to keep the frequently used&amp;nbsp;data in cache, reduce branch instructions, utilize DLP (data level parallelism) and TLP (thread level parallelism) etc.&amp;nbsp; Many other optimizations are also done in various MKL routines.&lt;/P&gt;

&lt;P&gt;--Vipin&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 15 Dec 2014 04:10:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/matrix-multiplication-speedup/m-p/1022176#M19731</guid>
      <dc:creator>VipinKumar_E_Intel</dc:creator>
      <dc:date>2014-12-15T04:10:57Z</dc:date>
    </item>
    <item>
      <title>Comparison of your code vs.</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/matrix-multiplication-speedup/m-p/1022177#M19732</link>
      <description>&lt;P&gt;Comparison of your code vs. reference BLAS source, and consideration of your compile options (and choice of compiler), would also be relevant to understanding these performance questions.&lt;/P&gt;</description>
      <pubDate>Mon, 15 Dec 2014 16:58:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/matrix-multiplication-speedup/m-p/1022177#M19732</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2014-12-15T16:58:06Z</dc:date>
    </item>
  </channel>
</rss>

