<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Ioan,  Could you give us M x in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Non-square-Matrix-Transpose/m-p/1051532#M21197</link>
    <description>&lt;P&gt;Ioan, &amp;nbsp;Could you give us M x N sizes instead of the # of elements?&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 26 Jun 2015 10:35:24 GMT</pubDate>
    <dc:creator>Gennady_F_Intel</dc:creator>
    <dc:date>2015-06-26T10:35:24Z</dc:date>
    <item>
      <title>Non-square Matrix Transpose</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Non-square-Matrix-Transpose/m-p/1051531#M21196</link>
      <description>&lt;P&gt;Hi guys,&lt;/P&gt;

&lt;P&gt;Are there any highly optimized MKL routines or maybe performance primitives that can do rectangle matrix transposition but without scaling?&lt;/P&gt;

&lt;P&gt;I've been using mkl_omatcopy but it seems to perform worse than a normal baseline implementation and I suspect this is due to the additional scaling that is performed. I've attached a plot running a naive baseline implementation with comparison on omatcopy and imatcopy. The latter I know runs very poorly on non-square matrices.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;I just want to know whether I should start spending some time optimizing my own transpose routine with AVX/AVX2 and blocking or whether there's a very efficient one out there already.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Also, swapping indices is not viable for what I am trying to achieve.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Thank you in advance!&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Ioan&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 24 Jun 2015 10:40:44 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Non-square-Matrix-Transpose/m-p/1051531#M21196</guid>
      <dc:creator>Ioan_Hadade</dc:creator>
      <dc:date>2015-06-24T10:40:44Z</dc:date>
    </item>
    <item>
      <title>Ioan,  Could you give us M x</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Non-square-Matrix-Transpose/m-p/1051532#M21197</link>
      <description>&lt;P&gt;Ioan, &amp;nbsp;Could you give us M x N sizes instead of the # of elements?&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 26 Jun 2015 10:35:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Non-square-Matrix-Transpose/m-p/1051532#M21197</guid>
      <dc:creator>Gennady_F_Intel</dc:creator>
      <dc:date>2015-06-26T10:35:24Z</dc:date>
    </item>
    <item>
      <title>Hi Gennady,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Non-square-Matrix-Transpose/m-p/1051533#M21198</link>
      <description>&lt;P style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;Hi Gennady,&lt;/P&gt;

&lt;P style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;Thanks for your reply. The transpositions I am performing are related to the dimension lifted transposition as seen in Henretty et al (&lt;A href="http://repository.cmu.edu/cgi/viewcontent.cgi?article=1263&amp;amp;context=ece" style="color: rgb(17, 85, 204);" target="_blank"&gt;&lt;/A&gt;&lt;A href="http://repository.cmu.edu/" target="_blank"&gt;http://repository.cmu.edu/&lt;/A&gt;&lt;WBR /&gt;cgi/viewcontent.cgi?article=&lt;WBR /&gt;1263&amp;amp;context=ece). Basically, it performs the required data layout organisation as to allow for aligned vector loads and stores of stencils in the x-direction path.&lt;/P&gt;

&lt;P style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;Anyway, I am basically transposing these large vectors into MxN arrays where N is always the SIMD register size which for this case is 4 as I am doing double precision. Therefore, on the graph, all matrix sizes will be MxN where M=no of element/veclen and N=veclen.&lt;/P&gt;

&lt;P style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;I guess this could be a cause for the poor performance due to gather and scatters? By the way, I am running this on a Xeon E5-2650 (Sandy Bridge).&lt;/P&gt;

&lt;P style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;The code looks something like this:&lt;/P&gt;

&lt;P style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;// out of place MKL transposition&lt;/P&gt;

&lt;P style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;&amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; mkl_domatcopy('r','t',VECLEN,&lt;WBR /&gt;NV,1,&amp;amp;q,NV,&amp;amp;qt,VECLEN);&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;

&lt;P style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; mkl_domatcopy('r','t',VECLEN,&lt;WBR /&gt;NV,1,&amp;amp;aux,NV,&amp;amp;auxt, VECLEN);&lt;/P&gt;

&lt;P style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;roe_fluxes_xplane();&amp;nbsp;&lt;/P&gt;

&lt;P style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;// retranspose data back into original format for y-sweep of flucrd&lt;/P&gt;

&lt;P style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; mkl_domatcopy('r','t',NV,&lt;WBR /&gt;VECLEN,1,&amp;amp;qt,VECLEN,&amp;amp;q,NV);&lt;/P&gt;

&lt;P style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; mkl_domatcopy('r','t',NV,&lt;WBR /&gt;VECLEN,1,&amp;amp;auxt,VECLEN,&amp;amp;aux,NV);&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;

&lt;P style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;So basically I need to transpose the data into the DLT format and then back again. Originally, the matrices will have a rectangle shape format, as they represent distinct blocks from a multiblock grid.&lt;/P&gt;

&lt;P style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;Thank you in advance for your kind consideration.&lt;/P&gt;</description>
      <pubDate>Fri, 26 Jun 2015 10:50:59 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Non-square-Matrix-Transpose/m-p/1051533#M21198</guid>
      <dc:creator>Ioan_Hadade</dc:creator>
      <dc:date>2015-06-26T10:50:59Z</dc:date>
    </item>
  </channel>
</rss>

