<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Matrix Transpose for char/short array in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798620#M575</link>
    <description>&lt;DIV id="tiny_quote"&gt;&lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;Quoting &lt;A jquery1330825060453="58" rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=555188" href="https://community.intel.com/en-us/profile/555188/" class="basic"&gt;gautam.himanshu&lt;/A&gt;&lt;/DIV&gt;&lt;DIV style="background-color: #e5e5e5; margin-left: 2px; margin-right: 2px; border: 1px inset; padding: 5px;"&gt;&lt;I&gt;We can find the macro &lt;STRONG&gt;__MM_TRANSPOSE_PS&lt;/STRONG&gt; for transpose of floats...&lt;/I&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;P&gt;Does it make sense to use a &lt;STRONG&gt;SSE&lt;/STRONG&gt; based transpose for a &lt;STRONG&gt;4x4&lt;/STRONG&gt; matrix instead of a &lt;STRONG&gt;Classic&lt;/STRONG&gt; algorithm?&lt;/P&gt;&lt;P&gt;Please take a look atresults of a test:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;DEBUG configuration&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt; &amp;gt; Test1028 Start &amp;lt;&lt;BR /&gt; Sub-Test 5 - 200,000,000 calls to [ &lt;STRONG&gt;CLASSIC&lt;/STRONG&gt; 4x4 Matrix Transpose ] - &lt;STRONG&gt;19657&lt;/STRONG&gt; ticks&lt;BR /&gt; Sub-Test 6 - 200,000,000 calls to [ &lt;STRONG&gt;SSE&lt;/STRONG&gt; 4x4 Matrix Transpose  ] - &lt;STRONG&gt;8640&lt;/STRONG&gt; ticks // &lt;STRONG&gt;2.28&lt;/STRONG&gt;x faster&lt;BR /&gt; &amp;gt; Test1028 End &amp;lt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;RELEASE configuration&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt; &amp;gt; Test1028 Start &amp;lt;&lt;BR /&gt; Sub-Test 5 - 200,000,000 calls to [ &lt;STRONG&gt;CLASSIC&lt;/STRONG&gt; 4x4 Matrix Transpose ] - &lt;STRONG&gt;18563&lt;/STRONG&gt; ticks&lt;BR /&gt; Sub-Test 6 - 200,000,000 calls to [ &lt;STRONG&gt;SSE&lt;/STRONG&gt; 4x4 Matrix Transpose  ] - &lt;STRONG&gt;5843&lt;/STRONG&gt; ticks // &lt;STRONG&gt;3.18&lt;/STRONG&gt;x faster&lt;BR /&gt; &amp;gt; Test1028 End &amp;lt;&lt;/P&gt;&lt;/DIV&gt;</description>
    <pubDate>Sun, 04 Mar 2012 01:42:47 GMT</pubDate>
    <dc:creator>SergeyKostrov</dc:creator>
    <dc:date>2012-03-04T01:42:47Z</dc:date>
    <item>
      <title>Matrix Transpose for char/short array</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798616#M571</link>
      <description>We can find the macro __MM_TRANSPOSE_PS for transpose of floats. But I am interested in doing transpose of an array of characters. I was able to write _MM_TRANSPOSE_PS myself using unpack and move intrinsics, but can't find similar intrinsics for chars. &lt;BR /&gt;&lt;BR /&gt;Can anyone please help as to what approach should be taken in this situation.&lt;BR /&gt;&lt;BR /&gt;Thanks&lt;BR /&gt;HG&lt;BR /&gt;</description>
      <pubDate>Fri, 02 Mar 2012 13:05:21 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798616#M571</guid>
      <dc:creator>gautam_himanshu</dc:creator>
      <dc:date>2012-03-02T13:05:21Z</dc:date>
    </item>
    <item>
      <title>Matrix Transpose for char/short array</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798617#M572</link>
      <description>Depending on the array size PSHUFB can be a solution or at least be helpful doing it.&lt;BR /&gt;Please give an example.</description>
      <pubDate>Fri, 02 Mar 2012 14:52:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798617#M572</guid>
      <dc:creator>sirrida</dc:creator>
      <dc:date>2012-03-02T14:52:57Z</dc:date>
    </item>
    <item>
      <title>Matrix Transpose for char/short array</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798618#M573</link>
      <description>Show us your declaration for the character array.&lt;BR /&gt;This will give us an idea of the size of the array and the number of dimensions.&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Fri, 02 Mar 2012 21:49:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798618#M573</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2012-03-02T21:49:15Z</dc:date>
    </item>
    <item>
      <title>Matrix Transpose for char/short array</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798619#M574</link>
      <description>An '&lt;STRONG&gt;_MM_TRANSPOSE4_PS&lt;/STRONG&gt;' macro declared in '&lt;STRONG&gt;xmmintrin.h&lt;/STRONG&gt;' is designed to use a &lt;STRONG&gt;4x4&lt;/STRONG&gt; matrix of &lt;STRONG&gt;&lt;SPAN style="text-decoration: underline;"&gt;floats&lt;BR /&gt;&lt;/SPAN&gt;&lt;/STRONG&gt;( single-precision ) as input.&lt;BR /&gt;&lt;BR /&gt;In case of a similar approach for a '&lt;STRONG&gt;char&lt;/STRONG&gt;' type the biggest dimension will be &lt;STRONG&gt;16x16&lt;/STRONG&gt;, and for '&lt;STRONG&gt;short&lt;/STRONG&gt;' type it will be &lt;STRONG&gt;8x8&lt;/STRONG&gt;.&lt;BR /&gt;&lt;BR /&gt;I'd like to repeatthe same question:&lt;BR /&gt;&lt;BR /&gt; &lt;STRONG&gt;How big are your 'char' / 'short' matricies?&lt;/STRONG&gt;&lt;BR /&gt;&lt;BR /&gt;Best regards,&lt;BR /&gt;Sergey&lt;BR /&gt;</description>
      <pubDate>Sat, 03 Mar 2012 02:24:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798619#M574</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2012-03-03T02:24:57Z</dc:date>
    </item>
    <item>
      <title>Matrix Transpose for char/short array</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798620#M575</link>
      <description>&lt;DIV id="tiny_quote"&gt;&lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;Quoting &lt;A jquery1330825060453="58" rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=555188" href="https://community.intel.com/en-us/profile/555188/" class="basic"&gt;gautam.himanshu&lt;/A&gt;&lt;/DIV&gt;&lt;DIV style="background-color: #e5e5e5; margin-left: 2px; margin-right: 2px; border: 1px inset; padding: 5px;"&gt;&lt;I&gt;We can find the macro &lt;STRONG&gt;__MM_TRANSPOSE_PS&lt;/STRONG&gt; for transpose of floats...&lt;/I&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;P&gt;Does it make sense to use a &lt;STRONG&gt;SSE&lt;/STRONG&gt; based transpose for a &lt;STRONG&gt;4x4&lt;/STRONG&gt; matrix instead of a &lt;STRONG&gt;Classic&lt;/STRONG&gt; algorithm?&lt;/P&gt;&lt;P&gt;Please take a look atresults of a test:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;DEBUG configuration&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt; &amp;gt; Test1028 Start &amp;lt;&lt;BR /&gt; Sub-Test 5 - 200,000,000 calls to [ &lt;STRONG&gt;CLASSIC&lt;/STRONG&gt; 4x4 Matrix Transpose ] - &lt;STRONG&gt;19657&lt;/STRONG&gt; ticks&lt;BR /&gt; Sub-Test 6 - 200,000,000 calls to [ &lt;STRONG&gt;SSE&lt;/STRONG&gt; 4x4 Matrix Transpose  ] - &lt;STRONG&gt;8640&lt;/STRONG&gt; ticks // &lt;STRONG&gt;2.28&lt;/STRONG&gt;x faster&lt;BR /&gt; &amp;gt; Test1028 End &amp;lt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;RELEASE configuration&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt; &amp;gt; Test1028 Start &amp;lt;&lt;BR /&gt; Sub-Test 5 - 200,000,000 calls to [ &lt;STRONG&gt;CLASSIC&lt;/STRONG&gt; 4x4 Matrix Transpose ] - &lt;STRONG&gt;18563&lt;/STRONG&gt; ticks&lt;BR /&gt; Sub-Test 6 - 200,000,000 calls to [ &lt;STRONG&gt;SSE&lt;/STRONG&gt; 4x4 Matrix Transpose  ] - &lt;STRONG&gt;5843&lt;/STRONG&gt; ticks // &lt;STRONG&gt;3.18&lt;/STRONG&gt;x faster&lt;BR /&gt; &amp;gt; Test1028 End &amp;lt;&lt;/P&gt;&lt;/DIV&gt;</description>
      <pubDate>Sun, 04 Mar 2012 01:42:47 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798620#M575</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2012-03-04T01:42:47Z</dc:date>
    </item>
    <item>
      <title>Matrix Transpose for char/short array</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798621#M576</link>
      <description>I am working on some benchmarks and generally taking sizes like 1k x 1k. shuffling the xmm registers seem the only posssible way which i dont think will give some good gains.</description>
      <pubDate>Mon, 05 Mar 2012 04:03:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798621#M576</guid>
      <dc:creator>gautam_himanshu</dc:creator>
      <dc:date>2012-03-05T04:03:53Z</dc:date>
    </item>
    <item>
      <title>Matrix Transpose for char/short array</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798622#M577</link>
      <description>How about using IPP? &lt;A href="http://software.intel.com/sites/products/documentation/hpc/ipp/ippi/ippi_ch4/functn_Transpose.html" target="_blank"&gt;http://software.intel.com/sites/products/documentation/hpc/ipp/ippi/ippi_ch4/functn_Transpose.html&lt;/A&gt;</description>
      <pubDate>Mon, 05 Mar 2012 06:22:10 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798622#M577</guid>
      <dc:creator>styc</dc:creator>
      <dc:date>2012-03-05T06:22:10Z</dc:date>
    </item>
    <item>
      <title>Matrix Transpose for char/short array</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798623#M578</link>
      <description>&lt;DIV id="tiny_quote"&gt;&lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;Quoting &lt;A jquery1330957388000="58" rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=555188" href="https://community.intel.com/en-us/profile/555188/" class="basic"&gt;gautam.himanshu&lt;/A&gt;&lt;/DIV&gt;&lt;DIV style="background-color: #e5e5e5; margin-left: 2px; margin-right: 2px; border: 1px inset; padding: 5px;"&gt;&lt;I&gt;I am working on some benchmarks and generally taking sizes like &lt;STRONG&gt;1k x 1k&lt;/STRONG&gt;. shuffling the xmm registers seem the only posssible way which i dont think will give some good gains. &lt;/I&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&lt;BR /&gt;I couldprovide you with the performance numbers for two &lt;STRONG&gt;Matrix Transpose&lt;/STRONG&gt; algorithms, applied to a&lt;BR /&gt;&lt;STRONG&gt;1K x 1K&lt;/STRONG&gt; matrix,that I've implemented for my current project. That is,&lt;BR /&gt;&lt;BR /&gt; - a &lt;STRONG&gt;Classic&lt;/STRONG&gt; ( Two-For-Loops /Non-Inplace)&lt;BR /&gt;&lt;BR /&gt;and&lt;BR /&gt;&lt;BR /&gt; - a &lt;STRONG&gt;Diagonal Based&lt;/STRONG&gt;( Two-For-Loops / Inplace )&lt;BR /&gt;&lt;BR /&gt;The &lt;STRONG&gt;Diagonal Based&lt;/STRONG&gt; algorithm doesn't need a second outputmatrix andhas areduced number of&lt;BR /&gt;exchanges. It never "touches" values along the diagonal line from left-top corner to right-bottom corner of the matrix.&lt;/P&gt;</description>
      <pubDate>Mon, 05 Mar 2012 14:35:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798623#M578</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2012-03-05T14:35:24Z</dc:date>
    </item>
    <item>
      <title>Matrix Transpose for char/short array</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798624#M579</link>
      <description>&lt;P&gt;Please take a look at performance results.&lt;/P&gt;&lt;P&gt;Matrix size: &lt;STRONG&gt;1,024 x 1,024&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Classic&lt;/STRONG&gt; Transpose  - ( 128 transposes in 10.015 sec ) = 0.0782421875 sec&lt;BR /&gt;&lt;STRONG&gt;Diagonal&lt;/STRONG&gt; Transpose - (128 transposes in 5.609 sec) = 0.0438203125 sec =&amp;gt; ~&lt;STRONG&gt;1.79x&lt;/STRONG&gt; faster&lt;/P&gt;</description>
      <pubDate>Mon, 05 Mar 2012 21:04:17 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798624#M579</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2012-03-05T21:04:17Z</dc:date>
    </item>
    <item>
      <title>Matrix Transpose for char/short array</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798625#M580</link>
      <description>&lt;P&gt;Please take a look at results of another test.&lt;/P&gt;&lt;P&gt;If four &lt;STRONG&gt;__m128&lt;/STRONG&gt; variables:&lt;/P&gt;&lt;P&gt; ...&lt;BR /&gt; __m128 row1 = { 0x0 };&lt;BR /&gt; __m128 row2 = { 0x0 };&lt;BR /&gt; __m128 row3 = { 0x0 };&lt;BR /&gt; __m128 row4 = { 0x0 };&lt;BR /&gt; ...&lt;/P&gt;&lt;P&gt;initialized with &lt;SPAN style="text-decoration: underline;"&gt;characters&lt;/SPAN&gt; as follows:&lt;/P&gt;&lt;P&gt; ...&lt;BR /&gt; row1.m128_u8[ 0] = '0'; r1.m128_u8[ 1] = '1'; r1.m128_u8[ 2] = '2'; r1.m128_u8[ 3] = '3';&lt;BR /&gt; row1.m128_u8[ 4] = '4'; r1.m128_u8[ 5] = '5'; r1.m128_u8[ 6] = '6'; r1.m128_u8[ 7] = '7';&lt;BR /&gt; row1.m128_u8[ 8] = '8'; r1.m128_u8[ 9] = '9'; r1.m128_u8[10] = 'A'; r1.m128_u8[11] = 'B';&lt;BR /&gt; row1.m128_u8[12] = 'C'; r1.m128_u8[13] = 'D'; r1.m128_u8[14] = 'E'; r1.m128_u8[15] = 'F';&lt;BR /&gt; ...&lt;BR /&gt; &amp;lt; the same for rows row2, row3 and row4 &amp;gt;&lt;BR /&gt; ...&lt;/P&gt;&lt;P&gt;a &lt;STRONG&gt;Source Matrix&lt;/STRONG&gt; ( as characters ) will look like:&lt;/P&gt;&lt;P&gt; 0123456789ABCDEF&lt;BR /&gt; 0123456789ABCDEF&lt;BR /&gt; 0123456789ABCDEF&lt;BR /&gt; 0123456789ABCDEF&lt;/P&gt;&lt;P&gt;and after a call to:&lt;/P&gt;&lt;P&gt; ...&lt;BR /&gt; &lt;STRONG&gt;_MM_TRANSPOSE4_PS&lt;/STRONG&gt;( row1, row2, row3, row4 );&lt;BR /&gt; ...&lt;/P&gt;&lt;P&gt;a &lt;STRONG&gt;Transposed Matrix&lt;/STRONG&gt; will look like:&lt;/P&gt;&lt;P&gt; 0123012301230123&lt;BR /&gt; 4567456745674567&lt;BR /&gt; 89AB89AB89AB89AB&lt;BR /&gt; CDEFCDEFCDEFCDEF&lt;/P&gt;&lt;P&gt;This is wrong and &lt;SPAN style="text-decoration: underline;"&gt;there is nothing unusual&lt;/SPAN&gt; here. The &lt;STRONG&gt;_MM_TRANSPOSE4_PS&lt;/STRONG&gt; macro cannot be used for&lt;BR /&gt;transposing a &lt;STRONG&gt;4x16&lt;/STRONG&gt; matrix of characters because it was designed to transpose a &lt;STRONG&gt;4x4&lt;/STRONG&gt; matrix of floats.&lt;/P&gt;</description>
      <pubDate>Mon, 05 Mar 2012 21:09:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798625#M580</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2012-03-05T21:09:50Z</dc:date>
    </item>
    <item>
      <title>Matrix Transpose for char/short array</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798626#M581</link>
      <description>&lt;DIV id="tiny_quote"&gt;&lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;Quoting &lt;A jquery1331045908781="58" rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=555188" href="https://community.intel.com/en-us/profile/555188/" class="basic"&gt;gautam.himanshu&lt;/A&gt;&lt;/DIV&gt;&lt;DIV style="background-color: #e5e5e5; margin-left: 2px; margin-right: 2px; border: 1px inset; padding: 5px;"&gt;&lt;I&gt;I am working on some benchmarks and generally taking sizes like 1k x 1k. &lt;STRONG&gt;shuffling the xmm registers seem the only posssible way which i dont think will give some good gains&lt;/STRONG&gt;. &lt;/I&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&lt;BR /&gt;It would be interesting to see results of your R&amp;amp;D. Please provide some technical details and performance&lt;BR /&gt;numbersif you can.&lt;BR /&gt;&lt;BR /&gt;Did you consideran &lt;STRONG&gt;Eklundh&lt;/STRONG&gt; method of aMatrix Transpose?&lt;BR /&gt;&lt;BR /&gt;Best regards,&lt;BR /&gt;Sergey&lt;/P&gt;</description>
      <pubDate>Tue, 06 Mar 2012 15:04:19 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798626#M581</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2012-03-06T15:04:19Z</dc:date>
    </item>
    <item>
      <title>Matrix Transpose for char/short array</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798627#M582</link>
      <description>Thanks for your interest. i finally managed to do a good transpose using unpackepi8/16/32/64 instructions. its hard to give any numbers as transpose was a part of the actial problem. anyways i am intetested in eklundh method. what kind of numbets are reachable there?</description>
      <pubDate>Fri, 09 Mar 2012 13:45:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798627#M582</guid>
      <dc:creator>gautam_himanshu</dc:creator>
      <dc:date>2012-03-09T13:45:22Z</dc:date>
    </item>
    <item>
      <title>Matrix Transpose for char/short array</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798628#M583</link>
      <description>&lt;DIV id="tiny_quote"&gt;&lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;Quoting &lt;A jquery1331304184625="58" rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=555188" href="https://community.intel.com/en-us/profile/555188/" class="basic"&gt;gautam.himanshu&lt;/A&gt;&lt;/DIV&gt;&lt;DIV style="background-color: #e5e5e5; margin-left: 2px; margin-right: 2px; border: 1px inset; padding: 5px;"&gt;&lt;I&gt;Thanks for your interest. i finally managed to do a good transpose using unpackepi8/16/32/64 instructions. its hard to give any numbers as transpose was a part of the actial problem...&lt;/I&gt;&lt;/DIV&gt;&lt;BR /&gt;It would nice to see a performance comparison of your &lt;STRONG&gt;SSE&lt;/STRONG&gt; based algorithmwith a &lt;STRONG&gt;Classic&lt;/STRONG&gt; algorithm.&lt;BR /&gt;&lt;BR /&gt;The &lt;STRONG&gt;Eklundh&lt;/STRONG&gt; method for a matrix transpose makes moreiterationsandmoreexchangescompared to a&lt;BR /&gt;&lt;STRONG&gt;Diagonal&lt;/STRONG&gt; based algorithm.&lt;/DIV&gt;</description>
      <pubDate>Fri, 09 Mar 2012 14:49:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798628#M583</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2012-03-09T14:49:54Z</dc:date>
    </item>
    <item>
      <title>Matrix Transpose for char/short array</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798629#M584</link>
      <description>&lt;DIV id="tiny_quote"&gt;&lt;DIV style="margin-left: 2px; margin-right: 2px;"&gt;Quoting &lt;A jquery1331490423218="58" rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=555188" href="https://community.intel.com/en-us/profile/555188/" class="basic"&gt;gautam.himanshu&lt;/A&gt;&lt;/DIV&gt;&lt;DIV style="background-color: #e5e5e5; margin-left: 2px; margin-right: 2px; border: 1px inset; padding: 5px;"&gt;&lt;I&gt;...anyways i am intetested in eklundh method. &lt;SPAN style="text-decoration: underline;"&gt;what kind of numbets are reachable there?&lt;/SPAN&gt;...&lt;/I&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&lt;BR /&gt;Here is a comparisonof number of exchangesfordifferent algorithms. In case of an &lt;STRONG&gt;8x8&lt;/STRONG&gt; matrix:&lt;BR /&gt;&lt;BR /&gt; &lt;STRONG&gt;Classic&lt;/STRONG&gt;-64 exchanges&lt;BR /&gt; &lt;STRONG&gt;Diagonal&lt;/STRONG&gt; -28 exchanges&lt;BR /&gt; &lt;STRONG&gt;Eklundh&lt;/STRONG&gt; - 48 exchanges&lt;BR /&gt;&lt;BR /&gt;Take into account that for&lt;STRONG&gt;Diagonal&lt;/STRONG&gt; and &lt;STRONG&gt;Eklundh&lt;/STRONG&gt; algorithms an input matrix must be Square and both&lt;BR /&gt;algorithms areInplace (don't need an output matrix ).&lt;/P&gt;</description>
      <pubDate>Sun, 11 Mar 2012 19:00:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798629#M584</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2012-03-11T19:00:53Z</dc:date>
    </item>
    <item>
      <title>Matrix Transpose for char/short array</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798630#M585</link>
      <description>&lt;DIV&gt;pshufb will only work for transposing 4x4 byte matrices.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;I've attatched assembly code for x64 that will transpose 16x16 byte data very rapidly&lt;DIV&gt;It is based on using punpcklbw and punpckhbw to interleave data between 16x1 colomns stored in xmm registers.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;To link it to your c++ project you will need to assemble the file (I use JWasm assembler), link the obj file to your project, then insert these lines to define it as a function.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV id="_mcePaste"&gt;#ifdef __cplusplus&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;extern "C" {&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;#endif&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;  void Transpose16x16A(char* a, char* b);&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;at);&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;#ifdef __cplusplus&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;}&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;#endif&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;The basic idea is to interleave data between colomns. I will use a 4x4 matrix as an example&lt;/DIV&gt;&lt;DIV&gt;xmm0  xmm1  xmm2  xmm3&lt;/DIV&gt;&lt;DIV&gt;a0     b0     c0     d0&lt;/DIV&gt;&lt;DIV&gt;a1     b1     c1     d1&lt;/DIV&gt;&lt;DIV&gt;a2     b2     c2     d2&lt;/DIV&gt;&lt;DIV&gt;a3     b3     c3     d3&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;The MERGE macro in my code is defined like this:&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV&gt;MERGE  MACRO FIRST, SECOND, TEMP&lt;/DIV&gt;&lt;DIV&gt;    movdqa TEMP, FIRST&lt;/DIV&gt;&lt;DIV&gt;    punpcklbw FIRST, SECOND&lt;/DIV&gt;&lt;DIV&gt;    punpckhbw TEMP, SECOND&lt;/DIV&gt;&lt;DIV&gt;    movdqa SECOND, TEMP&lt;/DIV&gt;&lt;DIV&gt;ENDM&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;It interleaves data from 2 colomns of the matrix. In our example&lt;/DIV&gt;&lt;DIV&gt;If we apply&lt;/DIV&gt;&lt;DIV&gt;MERGE xmm0, xmm2, xmm4&lt;/DIV&gt;&lt;DIV&gt;MERGE xmm1, xmm3, xmm4 we get:&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;xmm0  xmm1  xmm2  xmm3&lt;/DIV&gt;&lt;DIV&gt;a0      b0    a2     b2&lt;/DIV&gt;&lt;DIV&gt;c0      d0    c2     d2&lt;/DIV&gt;&lt;DIV&gt;a1      b1    a3     b3&lt;/DIV&gt;&lt;DIV&gt;c1      d1     c3     d3&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;The data has been moved across colomns and we are just one step away from the transpose.&lt;/DIV&gt;&lt;DIV&gt;Applying&lt;/DIV&gt;&lt;DIV&gt;MERGE xmm0, xmm1, xmm4&lt;/DIV&gt;&lt;DIV&gt;MERGE xmm2, xmm3, xmm4&lt;/DIV&gt;&lt;DIV&gt;we get:&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;xmm0  xmm1  xmm2  xmm3&lt;/DIV&gt;&lt;DIV&gt;a0      a1     a2     a3&lt;/DIV&gt;&lt;DIV&gt;b0      b1     b2     b3&lt;/DIV&gt;&lt;DIV&gt;c0      c1     c2     c3&lt;/DIV&gt;&lt;DIV&gt;d0      d1     d2     d3&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;the 16x16 transpose operates in the same way but requires more MERGE operations.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;One thing to note is my code DOES NOT PERSERVE ANY XMM REGISTERS. The reason for this is that it is slow to save the registers and my code didn't need to. If you need to preserve registers, my suggestion would be to make a function that saves the registers, runs the 16x16 transpose in an assembly loop to generate a bunch of 16x16 transposes, and then restores the registers.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;When I timed this algorithm (Intel core i2 processor) on byte data, I achieved speds of on average, 1/4 clock cycle per byte. That means the full 16x16 transpose should take roughly 64 clock cycles. I found this to be about 6x faster than doing the regular scalar algorithm (doublely nested loop with loop unrolling)&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Tue, 13 Mar 2012 15:39:28 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798630#M585</guid>
      <dc:creator>nick_1234</dc:creator>
      <dc:date>2012-03-13T15:39:28Z</dc:date>
    </item>
    <item>
      <title>Matrix Transpose for char/short array</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798631#M586</link>
      <description>That looks interesting and I'll try to test it (&lt;STRONG&gt;Visual Studio 2005&lt;/STRONG&gt;\ ona &lt;STRONG&gt;32-bit&lt;/STRONG&gt; Windows platform ).&lt;BR /&gt;&lt;BR /&gt;A couple of days ago I tested '&lt;STRONG&gt;_MM_TRANSPOSE4_PS&lt;/STRONG&gt;' macro vs. '&lt;STRONG&gt;No-For-Loops&lt;/STRONG&gt;' codes ( just exchanges )&lt;BR /&gt;for a &lt;STRONG&gt;4x4&lt;/STRONG&gt; matrix of floats and it outperforms the macro in a couple of times. I'll post results for comparison later.&lt;BR /&gt;</description>
      <pubDate>Wed, 14 Mar 2012 13:04:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Matrix-Transpose-for-char-short-array/m-p/798631#M586</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2012-03-14T13:04:03Z</dc:date>
    </item>
  </channel>
</rss>

