<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Hi, in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Slow-rectangular-matrix-transposition/m-p/1034955#M20380</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;You are right, mkl_?imatcopy is not optimized for non-square cases since even optimized case would much slower than out-of-place transposition. So in general, in such situations we usually either use mkl_?omatcopy or use gather-operation-scatter technique if it is suitable for algorithm (e.g. copy some block of data to the temporary buffer, perform needed operations and scatter the data back to its place -- this technique allows to reuse data in cache and generally improve the performance).&lt;/P&gt;

&lt;P&gt;Square case is well optimized, since it is the case when mkl_?imatcopy can really help.&lt;/P&gt;</description>
    <pubDate>Wed, 22 Apr 2015 07:01:22 GMT</pubDate>
    <dc:creator>Evarist_F_Intel</dc:creator>
    <dc:date>2015-04-22T07:01:22Z</dc:date>
    <item>
      <title>Slow rectangular matrix transposition ?</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Slow-rectangular-matrix-transposition/m-p/1034954#M20379</link>
      <description>Hello, 

I'm working with MKL 11.2.0.090 on Gentoo. I have an "Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz" processor. 

I'm trying to speed my inplace matrix transpositions and for  that I thought that mkl_?imatcopy would be the solution. I have a very speedup on square matrix, but on rectangular matrix it is much worse than my naive "follow the cycles" implementation. 

Here is the call: 

        mkl_dimatcopy('R', 'T', rows, cols, 1.0, matrix_ptr, rows, cols);

When I profiled the executable, most of the cycles were spent in 

        libmkl_avx.so      [.] mkl_trans_avx_mkl_dimatcopy_mipt_t

Am I doing something wrong or is simply the algorithm not good on rectangular matrix (I'd be surprised) ? Should I simply make an O(MN) space algorithm ? 

Thanks</description>
      <pubDate>Wed, 22 Apr 2015 06:28:36 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Slow-rectangular-matrix-transposition/m-p/1034954#M20379</guid>
      <dc:creator>Baptiste_W_</dc:creator>
      <dc:date>2015-04-22T06:28:36Z</dc:date>
    </item>
    <item>
      <title>Hi,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Slow-rectangular-matrix-transposition/m-p/1034955#M20380</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;You are right, mkl_?imatcopy is not optimized for non-square cases since even optimized case would much slower than out-of-place transposition. So in general, in such situations we usually either use mkl_?omatcopy or use gather-operation-scatter technique if it is suitable for algorithm (e.g. copy some block of data to the temporary buffer, perform needed operations and scatter the data back to its place -- this technique allows to reuse data in cache and generally improve the performance).&lt;/P&gt;

&lt;P&gt;Square case is well optimized, since it is the case when mkl_?imatcopy can really help.&lt;/P&gt;</description>
      <pubDate>Wed, 22 Apr 2015 07:01:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Slow-rectangular-matrix-transposition/m-p/1034955#M20380</guid>
      <dc:creator>Evarist_F_Intel</dc:creator>
      <dc:date>2015-04-22T07:01:22Z</dc:date>
    </item>
    <item>
      <title>Hi,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Slow-rectangular-matrix-transposition/m-p/1034956#M20381</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;Thanks for the quick answer :)&lt;/P&gt;

&lt;P&gt;I will switch to copy/omatcopy for now. imatcopy has really impressive performances for square matrices.&lt;/P&gt;

&lt;P&gt;Even if not fully optimized for the rectangular case, I would have expected better performance than my naive algorithm.&lt;/P&gt;</description>
      <pubDate>Wed, 22 Apr 2015 07:18:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Slow-rectangular-matrix-transposition/m-p/1034956#M20381</guid>
      <dc:creator>Baptiste_W_</dc:creator>
      <dc:date>2015-04-22T07:18:00Z</dc:date>
    </item>
  </channel>
</rss>

