<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Help: Vectorization, x87 &amp; SSE2 in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/Help-Vectorization-x87-SSE2/m-p/893180#M2650</link>
    <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/335665"&gt;Michael Stoner (Intel)&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt;
&lt;DIV style="margin:0px;"&gt;Hello,&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;a) The vSplat macro is an abstraction for the SSE SHUFPS instruction.  Specifically it will perform a broadcast of element 'i' to every field in the XMM register, where 'i' is a value from 0-3.  I think the way it is coded will leave the original value 'v' untouched and return the broadcast result 'a' in a different register.  I am not sure what "a;" will do outside the macro.  I think it must be a way of returning that value from the macro.&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;b) _m_empty() is an intrinsic for the EMMS instruction which clears the MMX state to avoid aliasing between the MMX registers and the x87 FP stack.  It isn't needed for SSE routines.  We recommend that any existing MMX code should be ported to SSE2 for best efficiency.&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;c)&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;i) The matrix multiply loops might vectorize, especially if you add "#pragma vector always" and work with the vectorization report.  I'm thinking the vectorizer might have a hard time realizing the block-unroll-jam transformation that makes your intrinsics code work efficiently.&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;ii) I am not sure what you mean by checking for x87 code?  If you compile this as a 32-bit app with no instruction set target (QxK, QxW, QxP, etc.) then the compiler will generate x87 FP instructions for the loops.&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;iii)  It seems like you want to compare results computed with x87 and SSE instructions.  Compiling with no instruction set target should give you this comparison.  If you compile as 64-bit or use /QxK, /QxW, etc. then the loops will probably be coded with SSE scalar or packed instructions.&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;Note, the Intel Math Kernel Library contains fast matrix multiply routines that have been tuned to the metal for the latest Intel CPU's.&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;Regards,&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;Mike Stoner&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;Software Engineer&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;Intel SSG - Application Performance&lt;/DIV&gt;
&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;Thanks Michael for your valuable inputs, will look forward now.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Could you try answering this thread &lt;A href="http://software.intel.com/en-us/forums/showthread.php?t=62183" target="_blank"&gt;http://software.intel.com/en-us/forums/showthread.php?t=62183&lt;/A&gt; if possible.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;~BR&lt;/P&gt;</description>
    <pubDate>Wed, 03 Dec 2008 04:32:35 GMT</pubDate>
    <dc:creator>srimks</dc:creator>
    <dc:date>2008-12-03T04:32:35Z</dc:date>
    <item>
      <title>Help: Vectorization, x87 &amp; SSE2</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Help-Vectorization-x87-SSE2/m-p/893177#M2647</link>
      <description>&lt;P&gt;Hi All.&lt;/P&gt;
&lt;P&gt;I am new with SIMD programming, had a peice of code as below -&lt;/P&gt;
&lt;P&gt;--&lt;/P&gt;
&lt;P&gt;#include &lt;STDLIB.H&gt;&lt;BR /&gt;#include &lt;STDIO.H&gt;&lt;BR /&gt;#include &lt;MATH.H&gt;&lt;BR /&gt;&lt;BR /&gt;#include &lt;EMMINTRIN.H&gt;&lt;BR /&gt;&lt;BR /&gt;// Set up a vector type for a float[4] array for each vector type&lt;BR /&gt;typedef __m128 vFloat;&lt;BR /&gt;// Also define some macros to map a virtual SIMD language to&lt;BR /&gt;// each actual SIMD language.&lt;BR /&gt;&lt;BR /&gt;// Note that because i MUST be an immediate, it is incorrect here&lt;BR /&gt;// to alias i to a stackbased copy and replicate that 4 times.&lt;BR /&gt;#define vSplat( v, i ) ({ __m128 a = v; a = _mm_shuffle_ps( a, a, _MM_SHUFFLE(i,i,i,i) ); a; })&lt;BR /&gt;&lt;BR /&gt;inline __m128 vMADD( __m128 a, __m128 b, __m128 c )&lt;BR /&gt;{&lt;BR /&gt; return _mm_add_ps( c, _mm_mul_ps( a, b ) );&lt;BR /&gt;}&lt;BR /&gt;&lt;BR /&gt;#define vLoad( ptr ) _mm_load_ps( (float*) (ptr) )&lt;BR /&gt;#define vStore( v, ptr ) _mm_store_ps( (float*) (ptr), v )&lt;BR /&gt;#define vZero() _mm_setzero_ps()&lt;BR /&gt;&lt;BR /&gt;#define empty() _m_empty(); //clears the SSE2 registers and SSE2 state.&lt;BR /&gt;&lt;BR /&gt;// Prototype for a vector matrix multiply function&lt;BR /&gt;void MyMatrixMultiply( vFloat A[4], vFloat B[4], vFloat C[4] );&lt;BR /&gt;&lt;BR /&gt;int main( void )&lt;BR /&gt;{&lt;BR /&gt; // The vFloat type (defined previously) is a vector or scalar array that contains 4 floats&lt;BR /&gt; // Thus each one of these is a 10x10 matrix, stored in the C storage order.&lt;BR /&gt; vFloat A[10];&lt;BR /&gt; vFloat B[10];&lt;BR /&gt; vFloat C1[10];&lt;BR /&gt; vFloat C2[10];&lt;BR /&gt; int i, j, k;&lt;BR /&gt;&lt;BR /&gt;// Pointers to the elements in A, B, C1 and C2&lt;BR /&gt;float *a = (float*) &amp;amp;A;&lt;BR /&gt;float *b = (float*) &amp;amp;B;&lt;BR /&gt;float *c1 = (float*) &amp;amp;C1;&lt;BR /&gt;float *c2 = (float*) &amp;amp;C2;&lt;BR /&gt;&lt;BR /&gt;// Initialize the data&lt;BR /&gt;for( i = 0; i &amp;lt; 100; i++ )&lt;BR /&gt; {&lt;BR /&gt; a&lt;I&gt; = (double) (rand() - RAND_MAX/2) / (double) (RAND_MAX );&lt;BR /&gt; b&lt;I&gt; = (double) (rand() - RAND_MAX/2) / (double) (RAND_MAX );&lt;BR /&gt; c1&lt;I&gt; = c2&lt;I&gt; = 0.0;&lt;BR /&gt; }&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/EMMINTRIN.H&gt;&lt;/MATH.H&gt;&lt;/STDIO.H&gt;&lt;/STDLIB.H&gt;&lt;/P&gt;
&lt;P&gt;// Perform the brute-force version of matrix multiplication and use this later to check for correctness&lt;BR /&gt;printf( "Doing simple matrix multiply...\n" );&lt;BR /&gt;for( i = 0; i &amp;lt; 10; i++ )&lt;BR /&gt; for( j = 0; j &amp;lt; 10; j++ )&lt;BR /&gt; {&lt;BR /&gt; float result = 0.0f;&lt;BR /&gt; for( k = 0; k &amp;lt; 10; k++ )&lt;BR /&gt; result += a[ i * 10 + k] * b[ k * 10 + j ];&lt;BR /&gt; c1[ i * 10 + j ] = result;&lt;BR /&gt; }&lt;BR /&gt;&lt;BR /&gt; // The vector version&lt;BR /&gt; printf( "Doing vector matrix multiply...\n" );&lt;BR /&gt; MyMatrixMultiply( A, B, C2 );&lt;BR /&gt;&lt;BR /&gt; // Make sure that the results are correct&lt;BR /&gt; // Allow for some rounding error here&lt;BR /&gt; printf( "Verifying results..." );&lt;BR /&gt; for( i = 0 ; i &amp;lt; 100; i++ )&lt;BR /&gt; if( fabs( c1&lt;I&gt; - c2&lt;I&gt; ) &amp;gt; 1e-6 )&lt;BR /&gt; printf( "failed at %i,%i: %8.17g %8.17g\n", i/4, i&amp;amp;3, c1&lt;I&gt;, c2&lt;I&gt; );&lt;BR /&gt;&lt;BR /&gt; printf( "done.\n" );&lt;BR /&gt;&lt;BR /&gt; return 0;&lt;BR /&gt;}&lt;BR /&gt;&lt;BR /&gt;void MyMatrixMultiply( vFloat A[16], vFloat B[16], vFloat C[16] )&lt;BR /&gt;{&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;vFloat A1 = vLoad( A ); //Row 1 of A&lt;BR /&gt; vFloat A2 = vLoad( A + 1 ); //Row 2 of A&lt;BR /&gt; vFloat A3 = vLoad( A + 2 ); //Row 3 of A&lt;BR /&gt; vFloat A4 = vLoad( A + 3); //Row 4 of A&lt;BR /&gt; vFloat C1 = vZero(); //Row 1 of C, initialized to zero&lt;BR /&gt; vFloat C2 = vZero(); //Row 2 of C, initialized to zero&lt;BR /&gt; vFloat C3 = vZero(); //Row 3 of C, initialized to zero&lt;BR /&gt; vFloat C4 = vZero(); //Row 4 of C, initialized to zero&lt;BR /&gt; vFloat B1 = vLoad( B ); //Row 1 of B&lt;BR /&gt; vFloat B2 = vLoad( B + 1 ); //Row 2 of B&lt;BR /&gt; vFloat B3 = vLoad( B + 2 ); //Row 3 of B&lt;BR /&gt; vFloat B4 = vLoad( B + 3); //Row 4 of B&lt;BR /&gt; //Multiply the first row of B by the first column of A (do not sum across)&lt;BR /&gt; C1 = vMADD( vSplat( A1, 0 ), B1, C1 );&lt;BR /&gt; C2 = vMADD( vSplat( A2, 0 ), B1, C2 );&lt;BR /&gt; C3 = vMADD( vSplat( A3, 0 ), B1, C3 );&lt;BR /&gt; C4 = vMADD( vSplat( A4, 0 ), B1, C4 );&lt;BR /&gt; // Multiply the second row of B by the second column of A and&lt;BR /&gt; // add to the previous result (do not sum across)&lt;BR /&gt; C1 = vMADD( vSplat( A1, 1 ), B2, C1 );&lt;BR /&gt; C2 = vMADD( vSplat( A2, 1 ), B2, C2 );&lt;BR /&gt; C3 = vMADD( vSplat( A3, 1 ), B2, C3 );&lt;BR /&gt; C4 = vMADD( vSplat( A4, 1 ), B2, C4 );&lt;BR /&gt; // Multiply the third row of B by the third column of A and&lt;BR /&gt; // add to the previous result (do not sum across)&lt;BR /&gt; C1 = vMADD( vSplat( A1, 2 ), B3, C1 );&lt;BR /&gt; C2 = vMADD( vSplat( A2, 2 ), B3, C2 );&lt;BR /&gt; C3 = vMADD( vSplat( A3, 2 ), B3, C3 );&lt;BR /&gt; C4 = vMADD( vSplat( A4, 2 ), B3, C4 );&lt;BR /&gt; // Multiply the fourth row of B by the fourth column of A and&lt;BR /&gt; // add to the previous result (do not sum across)&lt;BR /&gt; C1 = vMADD( vSplat( A1, 3 ), B4, C1 );&lt;BR /&gt; C2 = vMADD( vSplat( A2, 3 ), B4, C2 );&lt;BR /&gt; C3 = vMADD( vSplat( A3, 3 ), B4, C3 );&lt;BR /&gt; C4 = vMADD( vSplat( A4, 3 ), B4, C4 );&lt;BR /&gt; // Write out the result to the destination&lt;BR /&gt; vStore( C1, C );&lt;BR /&gt; vStore( C2, C + 1 );&lt;BR /&gt; vStore( C3, C + 2 );&lt;BR /&gt; vStore( C4, C + 3 );&lt;/P&gt;
&lt;P&gt;empty(); //clears the SSE2 registers and SSE2 state.&lt;/P&gt;
&lt;P&gt;}&lt;/P&gt;
&lt;P&gt;------&lt;/P&gt;
&lt;P&gt;I have following queries though it could be simple for others, they are -&lt;/P&gt;
&lt;P&gt;(a) What does complete content in L#14 mean? &lt;BR /&gt;-----&lt;BR /&gt;#define vSplat( v, i ) ({ __m128 a = v; a = _mm_shuffle_ps( a, a, _MM_SHUFFLE(i,i,i,i) ); a; })&lt;BR /&gt; -----&lt;BR /&gt;Could it be elaborated with sequences as mentioned below for above MACRO. &lt;BR /&gt;"__m128 a = v; &lt;BR /&gt;a = _mm_shuffle_ps( a, a, _MM_SHUFFLE(i,i,i,i) ); &lt;BR /&gt;a;&lt;BR /&gt;&lt;BR /&gt;(b) In L#25, I had defined "_m_empty()" for SSe2 which is being called in MyMatrixMultiply() API at the end,&lt;/P&gt;
&lt;P&gt;does use of it at the end can improve performance or simply not needed for SSE2?&lt;BR /&gt; &lt;BR /&gt;(c) Within this code, can I perform -&lt;BR /&gt; (i) Vectorization with Compiler FLAGS(Intel/GNU) for FOR loops for L#56 - 63?&lt;BR /&gt; (ii) This code has been written to address SSE2, could it be checked for x87 FP operations registers too, if YES, how?&lt;BR /&gt; (iii) Can a comparison be generated both for x87 &amp;amp; SSE2 FP registers in performances, if YES - how it can be performed &amp;amp; analyzed?&lt;BR /&gt;&lt;BR /&gt;Appreciate you responses for above.&lt;/P&gt;
&lt;P&gt;I do refer Intel C++ Intrinisic Reference &amp;amp; Intel C++ Compiler documents.&lt;/P&gt;
&lt;P&gt;~BR&lt;/P&gt;</description>
      <pubDate>Mon, 01 Dec 2008 12:06:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Help-Vectorization-x87-SSE2/m-p/893177#M2647</guid>
      <dc:creator>srimks</dc:creator>
      <dc:date>2008-12-01T12:06:50Z</dc:date>
    </item>
    <item>
      <title>Re: Help: Vectorization, x87 &amp; SSE2</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Help-Vectorization-x87-SSE2/m-p/893178#M2648</link>
      <description>&lt;DIV style="margin:0px;"&gt;Hello,&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;a) ThevSplat macro is an abstraction for the SSE SHUFPS instruction. Specifically it willperform a broadcast of element 'i' to every field in the XMM register, where 'i' is a value from 0-3. I think the way it is coded will leave the original value 'v' untouched and return the broadcast result 'a' in a different register. I am not sure what "a;" will do outside the macro. I think it must be a way of returning that value from the macro.&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;b) _m_empty() is an intrinsic for the EMMS instruction which clears the MMX stateto avoid aliasing between the MMX registers and the x87 FP stack. It isn't needed for SSE routines. We recommend that any existing MMX code should be ported to SSE2 for best efficiency.&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;c)&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;i) The matrix multiply loopsmight vectorize, especially if you add "#pragma vector always" and work with the vectorization report. I'm thinking the vectorizer might have a hard time realizing the block-unroll-jam transformation that makes your intrinsics code work efficiently.&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;ii) I am not sure what you mean by checking for x87 code? If you compile this as a 32-bit app with no instruction set target (QxK, QxW, QxP, etc.) then the compiler will generate x87 FP instructions for the loops.&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;iii) It seems like you want to compare results computed with x87 and SSE instructions. Compiling with no instruction set target should give you this comparison. If you compile as 64-bit or use /QxK, /QxW, etc. then the loops will probably be coded with SSE scalar or packed instructions.&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;Note, the Intel Math Kernel Library contains fast matrix multiply routines that have been tuned to the metal for the latest Intel CPU's.&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;Regards,&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;Mike Stoner&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;Software Engineer&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;Intel SSG - Application Performance&lt;/DIV&gt;</description>
      <pubDate>Wed, 03 Dec 2008 00:14:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Help-Vectorization-x87-SSE2/m-p/893178#M2648</guid>
      <dc:creator>Michael_S_Intel8</dc:creator>
      <dc:date>2008-12-03T00:14:40Z</dc:date>
    </item>
    <item>
      <title>Re: Help: Vectorization, x87 &amp; SSE2</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Help-Vectorization-x87-SSE2/m-p/893179#M2649</link>
      <description>&lt;DIV style="margin: 0px; height: auto;"&gt;&lt;/DIV&gt;
&lt;P&gt;If you want to compare x87 and SSE2 results, you should use linux and plain Fortran or C source code. Intel compilers do present automatic unroll-and-jam optimization for matrix multiply with options -xP -O3. You should be able to drop back to full precision x87 code with -mp -pc80 -long-double. Recent gnu compilers will perform SSE2 vectorization with options -O3 -mpfmath=sse -ffast-math, and x87 code with -O3 -mfpmath=387.&lt;/P&gt;
&lt;P&gt;x87 on Windows is generally available only in 32-bit compilers, and the default for Microsoft compatibility is to set x87 53-bit precision mode. I don't see any value in an x87 comparison, unless you use 64-bit precision accumulation.&lt;/P&gt;
&lt;P&gt;You should be able to approach optimum performance without resorting to low level coding such as you suggested. In fact, it may be quite difficult to arrive at adequate base performance and correctness if you start with intrinsics. Optimization with threading is much more difficult, but still should be possible with C or Fortran source with OpenMP. As Mike suggested, you should consider MKL as a reference for professionally optimzed code, both threaded and not. It has the advantage of being a plug-in replacement for the netlib BLAS source code.&lt;/P&gt;</description>
      <pubDate>Wed, 03 Dec 2008 03:03:18 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Help-Vectorization-x87-SSE2/m-p/893179#M2649</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2008-12-03T03:03:18Z</dc:date>
    </item>
    <item>
      <title>Re: Help: Vectorization, x87 &amp; SSE2</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Help-Vectorization-x87-SSE2/m-p/893180#M2650</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/335665"&gt;Michael Stoner (Intel)&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt;
&lt;DIV style="margin:0px;"&gt;Hello,&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;a) The vSplat macro is an abstraction for the SSE SHUFPS instruction.  Specifically it will perform a broadcast of element 'i' to every field in the XMM register, where 'i' is a value from 0-3.  I think the way it is coded will leave the original value 'v' untouched and return the broadcast result 'a' in a different register.  I am not sure what "a;" will do outside the macro.  I think it must be a way of returning that value from the macro.&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;b) _m_empty() is an intrinsic for the EMMS instruction which clears the MMX state to avoid aliasing between the MMX registers and the x87 FP stack.  It isn't needed for SSE routines.  We recommend that any existing MMX code should be ported to SSE2 for best efficiency.&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;c)&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;i) The matrix multiply loops might vectorize, especially if you add "#pragma vector always" and work with the vectorization report.  I'm thinking the vectorizer might have a hard time realizing the block-unroll-jam transformation that makes your intrinsics code work efficiently.&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;ii) I am not sure what you mean by checking for x87 code?  If you compile this as a 32-bit app with no instruction set target (QxK, QxW, QxP, etc.) then the compiler will generate x87 FP instructions for the loops.&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;iii)  It seems like you want to compare results computed with x87 and SSE instructions.  Compiling with no instruction set target should give you this comparison.  If you compile as 64-bit or use /QxK, /QxW, etc. then the loops will probably be coded with SSE scalar or packed instructions.&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;Note, the Intel Math Kernel Library contains fast matrix multiply routines that have been tuned to the metal for the latest Intel CPU's.&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;Regards,&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;Mike Stoner&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;Software Engineer&lt;/DIV&gt;
&lt;DIV style="margin:0px;"&gt;Intel SSG - Application Performance&lt;/DIV&gt;
&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;Thanks Michael for your valuable inputs, will look forward now.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Could you try answering this thread &lt;A href="http://software.intel.com/en-us/forums/showthread.php?t=62183" target="_blank"&gt;http://software.intel.com/en-us/forums/showthread.php?t=62183&lt;/A&gt; if possible.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;~BR&lt;/P&gt;</description>
      <pubDate>Wed, 03 Dec 2008 04:32:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Help-Vectorization-x87-SSE2/m-p/893180#M2650</guid>
      <dc:creator>srimks</dc:creator>
      <dc:date>2008-12-03T04:32:35Z</dc:date>
    </item>
    <item>
      <title>Re: Help: Vectorization, x87 &amp; SSE2</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Help-Vectorization-x87-SSE2/m-p/893181#M2651</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/367365"&gt;tim18&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt;
&lt;P&gt;If you want to compare x87 and SSE2 results, you should use linux and plain Fortran or C source code.  Intel compilers do present automatic unroll-and-jam optimization for matrix multiply with options -xP -O3.  You should be able to drop back to full precision x87 code with -mp -pc80 -long-double.  Recent gnu compilers will perform SSE2 vectorization with options -O3 -mpfmath=sse, and x87 code with -O3 -mfpmath=387.&lt;/P&gt;
&lt;P&gt;x87 on Windows is generally available only in 32-bit compilers, and the default for Microsoft compatibility is to set x87 53-bit precision mode.  I don't see any value in an x87 comparison, unless you use 64-bit precision accumulation.&lt;/P&gt;
&lt;P&gt;You should be able to approach optimum performance without resorting to low level coding such as you suggested. In fact, it may be quite difficult to arrive at adequate base performance and correctness if you start with intrinsics.  Optimization with threading is much more difficult, but still should be possible with C or Fortran source with OpenMP.  As Mike suggested, you should consider MKL as a reference for professionally optimzed code, both threaded and not.  It has the advantage of being a plug-in replacement for the netlib BLAS source code.&lt;/P&gt;
&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;Thanks Tim for your valuable inputs, will look forward now.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;~BR&lt;/P&gt;</description>
      <pubDate>Wed, 03 Dec 2008 04:33:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Help-Vectorization-x87-SSE2/m-p/893181#M2651</guid>
      <dc:creator>srimks</dc:creator>
      <dc:date>2008-12-03T04:33:45Z</dc:date>
    </item>
    <item>
      <title>Re: Help: Vectorization, x87 &amp; SSE2</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Help-Vectorization-x87-SSE2/m-p/893182#M2652</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;Hi Michael.&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV style="width: 100%; margin-top: 5px;"&gt;&lt;BR /&gt;&lt;/DIV&gt;
&lt;DIV style="width: 100%; margin-top: 5px;"&gt;I did modified above code for SSE2 DP FP analysis as below -&lt;/DIV&gt;
&lt;DIV style="width: 100%; margin-top: 5px;"&gt;---&lt;/DIV&gt;
&lt;P&gt;#include &lt;STDLIB.H&gt;&lt;BR /&gt;#include &lt;STDIO.H&gt;&lt;BR /&gt;#include &lt;MATH.H&gt;&lt;BR /&gt;&lt;BR /&gt;#include &lt;EMMINTRIN.H&gt;&lt;BR /&gt;&lt;BR /&gt;// Set up a vector type for a float[4] array for each vector type&lt;BR /&gt;typedef __m128d vFloat;&lt;BR /&gt;&lt;BR /&gt;// Note that i is a mask which is an immediate&lt;BR /&gt;// Selects two specific DP FP values from a and b, based on the mask i. &lt;BR /&gt;// The mask must be an immediate. (SHUFPD)&lt;BR /&gt;#define vSplat( v, i ) ({ __m128d a = v; a = _mm_shuffle_pd( a, a, &lt;BR /&gt; _MM_SHUFFLE(i,i,i,i) ); a; }) &lt;BR /&gt;&lt;BR /&gt;inline __m128d vMADD(__m128d a, __m128d b, __m128d c)&lt;BR /&gt;{&lt;BR /&gt; return _mm_add_pd( c, _mm_mul_pd( a, b ) );&lt;BR /&gt;}&lt;BR /&gt;&lt;BR /&gt;#define vLoad( ptr ) _mm_load_pd( (double const *) (ptr)) // Loads two DP FP values (MOVAPD)&lt;BR /&gt;#define vStore( v, ptr ) _mm_store_pd( (double*) (ptr), v ) // Store two DP FP values (MOVAPD)&lt;BR /&gt;#define vZero() _mm_setzero_pd() // Sets two DP FP values to zero (XORPD)&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;// Prototype for a vector matrix multiply function&lt;BR /&gt;void MyMatrixMultiply( vFloat A[4], vFloat B[4], vFloat C[4] );&lt;BR /&gt;&lt;BR /&gt;int main( void )&lt;BR /&gt;{&lt;BR /&gt; // The vFloat type (defined previously) is a vector array that contains 2 double&lt;BR /&gt; // Thus each one of these is a 4x4 matrix, stored in the C storage order.&lt;BR /&gt; vFloat A[4];&lt;BR /&gt; vFloat B[4];&lt;BR /&gt; vFloat C1[4];&lt;BR /&gt; vFloat C2[4];&lt;BR /&gt; int i, j, k;&lt;BR /&gt;&lt;BR /&gt;// Pointers to the elements in A, B, C1 and C2&lt;BR /&gt;double *a = (double*) &amp;amp;A;&lt;BR /&gt;double *b = (double*) &amp;amp;B;&lt;BR /&gt;double *c1 = (double*) &amp;amp;C1;&lt;BR /&gt;double *c2 = (double*) &amp;amp;C2;&lt;BR /&gt;&lt;BR /&gt;// Initialize the data&lt;BR /&gt;for( i = 0; i &amp;lt; 4; i++ )&lt;BR /&gt; {&lt;BR /&gt; a&lt;I&gt; = (double) (rand() - RAND_MAX/2) / (double) (RAND_MAX );&lt;BR /&gt; b&lt;I&gt; = (double) (rand() - RAND_MAX/2) / (double) (RAND_MAX );&lt;BR /&gt; c1&lt;I&gt; = c2&lt;I&gt; = 0.0;&lt;BR /&gt; }&lt;BR /&gt;// Perform matrix multiplication and use this later to check for correctness&lt;BR /&gt;printf( "Doing simple matrix multiply...n" );&lt;BR /&gt;for( i = 0; i &amp;lt; 4; i++ )&lt;BR /&gt; for( j = 0; j &amp;lt; 4; j++ )&lt;BR /&gt; {&lt;BR /&gt; double result = 0.0f;&lt;BR /&gt; for( k = 0; k &amp;lt; 4; k++ )&lt;BR /&gt; result += a[ i * 4 + k] * b[ k * 4 + j ];&lt;BR /&gt; c1[ i * 4 + j ] = result;&lt;BR /&gt; }&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/EMMINTRIN.H&gt;&lt;/MATH.H&gt;&lt;/STDIO.H&gt;&lt;/STDLIB.H&gt;&lt;/P&gt;
&lt;P&gt;// The vector version&lt;BR /&gt; printf( "Doing vector matrix multiply...n" );&lt;BR /&gt; MyMatrixMultiply( A, B, C2 );&lt;BR /&gt;&lt;BR /&gt; // Make sure that the results are correct allow for some rounding error here&lt;BR /&gt; printf( "Verifying results..." );&lt;BR /&gt; for( i = 0 ; i &amp;lt; 16; i++ )&lt;BR /&gt; if( fabs( c1&lt;I&gt; - c2&lt;I&gt; ) &amp;gt; 1e-20 )&lt;BR /&gt; printf( "failed at %i,%i: %8.34g %8.34gn", i/4, i&amp;amp;3, c1&lt;I&gt;, c2&lt;I&gt; );&lt;BR /&gt;&lt;BR /&gt; printf( "done.n" );&lt;BR /&gt;&lt;BR /&gt; return 0;&lt;BR /&gt;&lt;BR /&gt;}&lt;BR /&gt;&lt;BR /&gt;void MyMatrixMultiply( vFloat A[4], vFloat B[4], vFloat C[4] )&lt;BR /&gt;{&lt;BR /&gt; vFloat A1 = vLoad( A ); //Row 1 of A&lt;BR /&gt; vFloat A2 = vLoad( A + 1 ); //Row 2 of A&lt;BR /&gt; vFloat A3 = vLoad( A + 2 ); //Row 3 of A&lt;BR /&gt; vFloat A4 = vLoad( A + 3); //Row 4 of A&lt;BR /&gt;&lt;BR /&gt; vFloat C1 = vZero(); //Row 1 of C, initialized to zero&lt;BR /&gt; vFloat C2 = vZero(); //Row 2 of C, initialized to zero&lt;BR /&gt; vFloat C3 = vZero(); //Row 3 of C, initialized to zero&lt;BR /&gt; vFloat C4 = vZero(); //Row 4 of C, initialized to zero&lt;BR /&gt;&lt;BR /&gt; vFloat B1 = vLoad( B ); //Row 1 of B&lt;BR /&gt; vFloat B2 = vLoad( B + 1 ); //Row 2 of B&lt;BR /&gt; vFloat B3 = vLoad( B + 2 ); //Row 3 of B&lt;BR /&gt; vFloat B4 = vLoad( B + 3 ); //Row 4 of B&lt;BR /&gt;&lt;BR /&gt; //Multiply the first row of B by the first column of A (do not sum across)&lt;BR /&gt; C1 = vMADD( vSplat( A1, 0 ), B1, C1 );&lt;BR /&gt; C2 = vMADD( vSplat( A2, 0 ), B1, C2 );&lt;BR /&gt; C3 = vMADD( vSplat( A3, 0 ), B1, C3 );&lt;BR /&gt; C4 = vMADD( vSplat( A4, 0 ), B1, C4 );&lt;BR /&gt;&lt;BR /&gt; // Multiply the second row of B by the second column of A and&lt;BR /&gt; // add to the previous result (do not sum across)&lt;BR /&gt; C1 = vMADD( vSplat( A1, 1 ), B2, C1 );&lt;BR /&gt; C2 = vMADD( vSplat( A2, 1 ), B2, C2 );&lt;BR /&gt; C3 = vMADD( vSplat( A3, 1 ), B2, C3 );&lt;BR /&gt; C4 = vMADD( vSplat( A4, 1 ), B2, C4 );&lt;BR /&gt;&lt;BR /&gt; // Multiply the third row of B by the third column of A and&lt;BR /&gt; // add to the previous result (do not sum across)&lt;BR /&gt; C1 = vMADD( vSplat( A1, 2 ), B3, C1 );&lt;BR /&gt; C2 = vMADD( vSplat( A2, 2 ), B3, C2 );&lt;BR /&gt; C3 = vMADD( vSplat( A3, 2 ), B3, C3 );&lt;BR /&gt; C4 = vMADD( vSplat( A4, 2 ), B3, C4 );&lt;BR /&gt;&lt;BR /&gt; // Multiply the fourth row of B by the fourth column of A and&lt;BR /&gt; // add to the previous result (do not sum across)&lt;BR /&gt; C1 = vMADD( vSplat( A1, 3 ), B4, C1 );&lt;BR /&gt; C2 = vMADD( vSplat( A2, 3 ), B4, C2 );&lt;BR /&gt; C3 = vMADD( vSplat( A3, 3 ), B4, C3 );&lt;BR /&gt; C4 = vMADD( vSplat( A4, 3 ), B4, C4 );&lt;BR /&gt;&lt;BR /&gt; // Write out the result to the destination&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;vStore( C1, C );&lt;/P&gt;
&lt;P&gt;vStore( C2, C + 1 );&lt;BR /&gt; vStore( C3, C + 2 );&lt;BR /&gt; vStore( C4, C + 3 );&lt;BR /&gt;}&lt;BR /&gt;---&lt;/P&gt;
&lt;P&gt;I tried compiling using GNU(v4.4), it was fine but when executing, I did get below errors -&lt;/P&gt;
&lt;P&gt;--&lt;/P&gt;
&lt;P&gt;time ./matrix-4X4-sse2&lt;BR /&gt;Doing simple matrix multiply...&lt;BR /&gt;Doing vector matrix multiply...&lt;BR /&gt;Verifying results...failed at 0,0: 0.1041077441809730858013338661294256 1.615112779972704616472675241484077e-310&lt;BR /&gt;failed at 0,1: 0.2180626815004354512872453142335871 2.329319807572869703155260521236589e-311&lt;BR /&gt;failed at 0,2: 0.06656423826015861466842693516809959 5.021118224839903602770002086837975e-311&lt;BR /&gt;failed at 0,3: 0.02341829647346145570896425169848953 -1.35546831104864329338482674488192e-311&lt;BR /&gt;failed at 2,0: 1.443534677236949856111745706357672e-314 0.1041077441809730858013338661294256&lt;BR /&gt;failed at 2,1: 1.202089919575080043699319648301355e-314 0.2180626815004354512872453142335871&lt;BR /&gt;failed at 2,2: 1.74640041019361879362255661637412e-314 0.06656423826015861466842693516809959&lt;BR /&gt;failed at 2,3: -6.987567207824738989361744406306725e-315 0.02341829647346145570896425169848953&lt;/P&gt;
&lt;P&gt;done.&lt;/P&gt;
&lt;P&gt;--&lt;/P&gt;
&lt;P&gt;I did looked for debugging, probably found -&lt;/P&gt;
&lt;P&gt;While analyzing the row matrix (MyMatrixMultiply()) API, L# 120, the "return (__m128)*(__v4DF)__P" didn't happened, directly the debugger jumped to " return __extension__ (__m128d){ 0.0, 0.0 }" of emmintrin.h SSE2 instructions header file.&lt;/P&gt;
&lt;P&gt;The previous SSE SP FP code as posted earlier did had above debugging flow within xmmintrin.h instructions header file.&lt;/P&gt;
&lt;P&gt;Did I miss anything while modifying SSE SP FP earlier code to SSE2 DP SP code of today?&lt;/P&gt;
&lt;P&gt;Any clue?&lt;/P&gt;
&lt;P&gt;~BR&lt;/P&gt;</description>
      <pubDate>Tue, 09 Dec 2008 12:35:59 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Help-Vectorization-x87-SSE2/m-p/893182#M2652</guid>
      <dc:creator>srimks</dc:creator>
      <dc:date>2008-12-09T12:35:59Z</dc:date>
    </item>
    <item>
      <title>Re: Help: Vectorization, x87 &amp; SSE2</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Help-Vectorization-x87-SSE2/m-p/893183#M2653</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="margin-top: 5px; width: 100%;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/407152"&gt;srimks&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt;
&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="margin-top: 5px; width: 100%;"&gt;Hi Michael.&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV style="margin-top: 5px; width: 100%;"&gt;&lt;BR /&gt;&lt;/DIV&gt;
&lt;DIV style="margin-top: 5px; width: 100%;"&gt;I did modified above code for SSE2 DP FP analysis as below -&lt;/DIV&gt;
&lt;DIV style="margin-top: 5px; width: 100%;"&gt;---&lt;/DIV&gt;
&lt;P&gt;#include &lt;STDLIB.H&gt;&lt;BR /&gt;#include &lt;STDIO.H&gt;&lt;BR /&gt;#include &lt;MATH.H&gt;&lt;BR /&gt;&lt;BR /&gt;#include &lt;EMMINTRIN.H&gt;&lt;BR /&gt;&lt;BR /&gt;// Set up a vector type for a float[4] array for each vector type&lt;BR /&gt;typedef __m128d vFloat;&lt;BR /&gt;&lt;BR /&gt;// Note that i is a mask which is an immediate&lt;BR /&gt;// Selects two specific DP FP values from a and b, based on the mask i. &lt;BR /&gt;// The mask must be an immediate. (SHUFPD)&lt;BR /&gt;#define vSplat( v, i ) ({ __m128d a = v; a = _mm_shuffle_pd( a, a, &lt;BR /&gt;_MM_SHUFFLE(i,i,i,i) ); a; }) &lt;BR /&gt;&lt;BR /&gt;inline __m128d vMADD(__m128d a, __m128d b, __m128d c)&lt;BR /&gt;{&lt;BR /&gt;return _mm_add_pd( c, _mm_mul_pd( a, b ) );&lt;BR /&gt;}&lt;BR /&gt;&lt;BR /&gt;#define vLoad( ptr ) _mm_load_pd( (double const *) (ptr)) // Loads two DP FP values (MOVAPD)&lt;BR /&gt;#define vStore( v, ptr ) _mm_store_pd( (double*) (ptr), v ) // Store two DP FP values (MOVAPD)&lt;BR /&gt;#define vZero() _mm_setzero_pd() // Sets two DP FP values to zero (XORPD)&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;// Prototype for a vector matrix multiply function&lt;BR /&gt;void MyMatrixMultiply( vFloat A[4], vFloat B[4], vFloat C[4] );&lt;BR /&gt;&lt;BR /&gt;int main( void )&lt;BR /&gt;{&lt;BR /&gt;// The vFloat type (defined previously) is a vector array that contains 2 double&lt;BR /&gt;// Thus each one of these is a 4x4 matrix, stored in the C storage order.&lt;BR /&gt;vFloat A[4];&lt;BR /&gt;vFloat B[4];&lt;BR /&gt;vFloat C1[4];&lt;BR /&gt;vFloat C2[4];&lt;BR /&gt;int i, j, k;&lt;BR /&gt;&lt;BR /&gt;// Pointers to the elements in A, B, C1 and C2&lt;BR /&gt;double *a = (double*) &amp;amp;A;&lt;BR /&gt;double *b = (double*) &amp;amp;B;&lt;BR /&gt;double *c1 = (double*) &amp;amp;C1;&lt;BR /&gt;double *c2 = (double*) &amp;amp;C2;&lt;BR /&gt;&lt;BR /&gt;// Initialize the data&lt;BR /&gt;for( i = 0; i &amp;lt; 4; i++ )&lt;BR /&gt;{&lt;BR /&gt;a&lt;I&gt; = (double) (rand() - RAND_MAX/2) / (double) (RAND_MAX );&lt;BR /&gt;b&lt;I&gt; = (double) (rand() - RAND_MAX/2) / (double) (RAND_MAX );&lt;BR /&gt;c1&lt;I&gt; = c2&lt;I&gt; = 0.0;&lt;BR /&gt;}&lt;BR /&gt;// Perform matrix multiplication and use this later to check for correctness&lt;BR /&gt;printf( "Doing simple matrix multiply...n" );&lt;BR /&gt;for( i = 0; i &amp;lt; 4; i++ )&lt;BR /&gt;for( j = 0; j &amp;lt; 4; j++ )&lt;BR /&gt;{&lt;BR /&gt;double result = 0.0f;&lt;BR /&gt;for( k = 0; k &amp;lt; 4; k++ )&lt;BR /&gt;result += a[ i * 4 + k] * b[ k * 4 + j ];&lt;BR /&gt;c1[ i * 4 + j ] = result;&lt;BR /&gt;}&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/EMMINTRIN.H&gt;&lt;/MATH.H&gt;&lt;/STDIO.H&gt;&lt;/STDLIB.H&gt;&lt;/P&gt;
&lt;P&gt;// The vector version&lt;BR /&gt;printf( "Doing vector matrix multiply...n" );&lt;BR /&gt;MyMatrixMultiply( A, B, C2 );&lt;BR /&gt;&lt;BR /&gt;// Make sure that the results are correct allow for some rounding error here&lt;BR /&gt;printf( "Verifying results..." );&lt;BR /&gt;for( i = 0 ; i &amp;lt; 16; i++ )&lt;BR /&gt;if( fabs( c1&lt;I&gt; - c2&lt;I&gt; ) &amp;gt; 1e-20 )&lt;BR /&gt;printf( "failed at %i,%i: %8.34g %8.34gn", i/4, i&amp;amp;3, c1&lt;I&gt;, c2&lt;I&gt; );&lt;BR /&gt;&lt;BR /&gt;printf( "done.n" );&lt;BR /&gt;&lt;BR /&gt;return 0;&lt;BR /&gt;&lt;BR /&gt;}&lt;BR /&gt;&lt;BR /&gt;void MyMatrixMultiply( vFloat A[4], vFloat B[4], vFloat C[4] )&lt;BR /&gt;{&lt;BR /&gt;vFloat A1 = vLoad( A ); //Row 1 of A&lt;BR /&gt;vFloat A2 = vLoad( A + 1 ); //Row 2 of A&lt;BR /&gt;vFloat A3 = vLoad( A + 2 ); //Row 3 of A&lt;BR /&gt;vFloat A4 = vLoad( A + 3); //Row 4 of A&lt;BR /&gt;&lt;BR /&gt;vFloat C1 = vZero(); //Row 1 of C, initialized to zero&lt;BR /&gt;vFloat C2 = vZero(); //Row 2 of C, initialized to zero&lt;BR /&gt;vFloat C3 = vZero(); //Row 3 of C, initialized to zero&lt;BR /&gt;vFloat C4 = vZero(); //Row 4 of C, initialized to zero&lt;BR /&gt;&lt;BR /&gt;vFloat B1 = vLoad( B ); //Row 1 of B&lt;BR /&gt;vFloat B2 = vLoad( B + 1 ); //Row 2 of B&lt;BR /&gt;vFloat B3 = vLoad( B + 2 ); //Row 3 of B&lt;BR /&gt;vFloat B4 = vLoad( B + 3 ); //Row 4 of B&lt;BR /&gt;&lt;BR /&gt;//Multiply the first row of B by the first column of A (do not sum across)&lt;BR /&gt;C1 = vMADD( vSplat( A1, 0 ), B1, C1 );&lt;BR /&gt;C2 = vMADD( vSplat( A2, 0 ), B1, C2 );&lt;BR /&gt;C3 = vMADD( vSplat( A3, 0 ), B1, C3 );&lt;BR /&gt;C4 = vMADD( vSplat( A4, 0 ), B1, C4 );&lt;BR /&gt;&lt;BR /&gt;// Multiply the second row of B by the second column of A and&lt;BR /&gt;// add to the previous result (do not sum across)&lt;BR /&gt;C1 = vMADD( vSplat( A1, 1 ), B2, C1 );&lt;BR /&gt;C2 = vMADD( vSplat( A2, 1 ), B2, C2 );&lt;BR /&gt;C3 = vMADD( vSplat( A3, 1 ), B2, C3 );&lt;BR /&gt;C4 = vMADD( vSplat( A4, 1 ), B2, C4 );&lt;BR /&gt;&lt;BR /&gt;// Multiply the third row of B by the third column of A and&lt;BR /&gt;// add to the previous result (do not sum across)&lt;BR /&gt;C1 = vMADD( vSplat( A1, 2 ), B3, C1 );&lt;BR /&gt;C2 = vMADD( vSplat( A2, 2 ), B3, C2 );&lt;BR /&gt;C3 = vMADD( vSplat( A3, 2 ), B3, C3 );&lt;BR /&gt;C4 = vMADD( vSplat( A4, 2 ), B3, C4 );&lt;BR /&gt;&lt;BR /&gt;// Multiply the fourth row of B by the fourth column of A and&lt;BR /&gt;// add to the previous result (do not sum across)&lt;BR /&gt;C1 = vMADD( vSplat( A1, 3 ), B4, C1 );&lt;BR /&gt;C2 = vMADD( vSplat( A2, 3 ), B4, C2 );&lt;BR /&gt;C3 = vMADD( vSplat( A3, 3 ), B4, C3 );&lt;BR /&gt;C4 = vMADD( vSplat( A4, 3 ), B4, C4 );&lt;BR /&gt;&lt;BR /&gt;// Write out the result to the destination&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;vStore( C1, C );&lt;/P&gt;
&lt;P&gt;vStore( C2, C + 1 );&lt;BR /&gt;vStore( C3, C + 2 );&lt;BR /&gt;vStore( C4, C + 3 );&lt;BR /&gt;}&lt;BR /&gt;---&lt;/P&gt;
&lt;P&gt;I tried compiling using GNU(v4.4), it was fine but when executing, I did get below errors -&lt;/P&gt;
&lt;P&gt;--&lt;/P&gt;
&lt;P&gt;time ./matrix-4X4-sse2&lt;BR /&gt;Doing simple matrix multiply...&lt;BR /&gt;Doing vector matrix multiply...&lt;BR /&gt;Verifying results...failed at 0,0: 0.1041077441809730858013338661294256 1.615112779972704616472675241484077e-310&lt;BR /&gt;failed at 0,1: 0.2180626815004354512872453142335871 2.329319807572869703155260521236589e-311&lt;BR /&gt;failed at 0,2: 0.06656423826015861466842693516809959 5.021118224839903602770002086837975e-311&lt;BR /&gt;failed at 0,3: 0.02341829647346145570896425169848953 -1.35546831104864329338482674488192e-311&lt;BR /&gt;failed at 2,0: 1.443534677236949856111745706357672e-314 0.1041077441809730858013338661294256&lt;BR /&gt;failed at 2,1: 1.202089919575080043699319648301355e-314 0.2180626815004354512872453142335871&lt;BR /&gt;failed at 2,2: 1.74640041019361879362255661637412e-314 0.06656423826015861466842693516809959&lt;BR /&gt;failed at 2,3: -6.987567207824738989361744406306725e-315 0.02341829647346145570896425169848953&lt;/P&gt;
&lt;P&gt;done.&lt;/P&gt;
&lt;P&gt;--&lt;/P&gt;
&lt;P&gt;I did looked for debugging, probably found -&lt;/P&gt;
&lt;P&gt;While analyzing the row matrix (MyMatrixMultiply()) API, L# 120, the "return (__m128)*(__v4DF)__P" didn't happened, directly the debugger jumped to " return __extension__ (__m128d){ 0.0, 0.0 }" of emmintrin.h SSE2 instructions header file.&lt;/P&gt;
&lt;P&gt;The previous SSE SP FP code as posted earlier did had above debugging flow within xmmintrin.h instructions header file.&lt;/P&gt;
&lt;P&gt;Did I miss anything while modifying SSE SP FP earlier code to SSE2 DP SP code of today?&lt;/P&gt;
&lt;P&gt;Any clue?&lt;/P&gt;
&lt;P&gt;~BR&lt;/P&gt;
&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;&lt;BR /&gt;I probably think below instructions as used in above codehas some alignment problems, the instructions used are -&lt;/P&gt;
&lt;P&gt;#define vLoad( ptr ) _mm_load_pd( (double const *) (ptr)) // Loads two DP FP values (MOVAPD)&lt;BR /&gt;#define vStore( v, ptr ) _mm_store_pd( (double*) (ptr), v ) // Store two DP FP values (MOVAPD)&lt;/P&gt;
&lt;P&gt;Could anyone suggest how to define proper "declspec(align(16))" or any alignment statement such that above instructions doesn't have alignment problems? This code when tested with SSE SP DP on Linux was succesfully executed but has problems in executionwith SSE2 DP FP on Linux&lt;/P&gt;
&lt;P&gt;Sorry for being naive.&lt;/P&gt;
&lt;P&gt;~BR&lt;/P&gt;</description>
      <pubDate>Sat, 13 Dec 2008 08:49:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Help-Vectorization-x87-SSE2/m-p/893183#M2653</guid>
      <dc:creator>srimks</dc:creator>
      <dc:date>2008-12-13T08:49:13Z</dc:date>
    </item>
    <item>
      <title>Re: Help: Vectorization, x87 &amp; SSE2</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Help-Vectorization-x87-SSE2/m-p/893184#M2654</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/407152"&gt;srimks&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt;
&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;P&gt;Could anyone suggest how to define proper "declspec(align(16))" or any alignment statement such that above instructions doesn't have alignment problems? This code when tested with SSE SP DP on Linux was succesfully executed but has problems in executionwith SSE2 DP FP on Linux&lt;/P&gt;
&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;Do you mean you're having difficulty finding the manual pages on gcc __attribute__(align(16)) ?&lt;/P&gt;
&lt;P&gt;It's in info gcc, and also on the web:&lt;/P&gt;
&lt;P&gt;&lt;A href="http://gcc.gnu.org/onlinedocs/gcc-3.2.3/gcc/Variable-Attributes.html" target="_blank"&gt;http://gcc.gnu.org/onlinedocs/gcc-3.2.3/gcc/Variable-Attributes.html&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;icc should support both __attribute__ and declspec. Both of those will need to support 32 byte alignment, according to the main topic of this forum, but gcc currently supports only the 16-byte subset of AVX.&lt;/P&gt;
&lt;P&gt;We're already past the mid point of the useable lifetime of SSE2 intrinsic coding; it may already be considered "legacy" stuff when AVX hardware arrives. It would be interesting to see to what extent binary translation might be capable of turning SSE2 into AVX over the next 3 years.&lt;/P&gt;</description>
      <pubDate>Sat, 13 Dec 2008 15:01:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Help-Vectorization-x87-SSE2/m-p/893184#M2654</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2008-12-13T15:01:38Z</dc:date>
    </item>
    <item>
      <title>Re: Help: Vectorization, x87 &amp; SSE2</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Help-Vectorization-x87-SSE2/m-p/893185#M2655</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="margin-top: 5px; width: 100%;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/367365"&gt;tim18&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt;
&lt;P&gt;Do you mean you're having difficulty finding the manual pages on gcc __attribute__(align(16)) ?&lt;/P&gt;
&lt;P&gt;It's in info gcc, and also on the web:&lt;/P&gt;
&lt;P&gt;&lt;A href="http://gcc.gnu.org/onlinedocs/gcc-3.2.3/gcc/Variable-Attributes.html" target="_blank"&gt;http://gcc.gnu.org/onlinedocs/gcc-3.2.3/gcc/Variable-Attributes.html&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;icc should support both __attribute__ and declspec. Both of those will need to support 32 byte alignment, according to the main topic of this forum, but gcc currently supports only the 16-byte subset of AVX.&lt;/P&gt;
&lt;P&gt;We're already past the mid point of the useable lifetime of SSE2 intrinsic coding; it may already be considered "legacy" stuff when AVX hardware arrives. It would be interesting to see to what extent binary translation might be capable of turning SSE2 into AVX over the next 3 years.&lt;/P&gt;
&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;No Tim.&lt;/P&gt;
&lt;P&gt;Simply, I mean here that this above code which has been designed for SSE2 DP SP has alignment problems. It doesn't has any problem in calling emmintrin.h fileof GCC, it's perefectly fine which I came to know while debugging.&lt;/P&gt;
&lt;P&gt;I think if I can take care of alignment for arguements (ptr, v)being passed in _mm_load_pd() &amp;amp; _mm_store_pd() instructions, it should be fine. How to define these alignment for arguement being passed to these instructions is the query?&lt;/P&gt;
&lt;P&gt;Thanks for your input though.&lt;/P&gt;
&lt;P&gt;~BR&lt;/P&gt;</description>
      <pubDate>Sat, 13 Dec 2008 15:13:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Help-Vectorization-x87-SSE2/m-p/893185#M2655</guid>
      <dc:creator>srimks</dc:creator>
      <dc:date>2008-12-13T15:13:51Z</dc:date>
    </item>
    <item>
      <title>Re: Help: Vectorization, x87 &amp; SSE2</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Help-Vectorization-x87-SSE2/m-p/893186#M2656</link>
      <description>&lt;DIV style="margin:0px;"&gt;
&lt;DIV id="quote_reply" style="width: 100%; margin-top: 5px;"&gt;
&lt;DIV style="margin-left:2px;margin-right:2px;"&gt;Quoting - &lt;A href="https://community.intel.com/en-us/profile/407152"&gt;srimks&lt;/A&gt;&lt;/DIV&gt;
&lt;DIV style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"&gt;&lt;EM&gt;
&lt;P&gt;&lt;BR /&gt;I probably think below instructions as used in above code has some alignment problems, the instructions used are -&lt;/P&gt;
&lt;P&gt;#define vLoad( ptr )         _mm_load_pd( (double const *) (ptr))  // Loads two DP FP values (MOVAPD)&lt;BR /&gt;#define vStore( v, ptr )     _mm_store_pd( (double*) (ptr), v )    // Store two DP FP values (MOVAPD)&lt;/P&gt;
&lt;P&gt;Could anyone suggest how to define proper "declspec(align(16))" or any alignment statement such that above instructions doesn't have alignment problems? This code when tested with SSE SP DP on Linux was succesfully executed but has problems in execution with SSE2 DP FP on Linux&lt;/P&gt;
&lt;P&gt;Sorry for being naive.&lt;/P&gt;
&lt;P&gt;~BR&lt;/P&gt;
&lt;/EM&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;Hi All.&lt;/P&gt;
&lt;P&gt;While using Intel C++ Compiler(v-10.0.23), the compilation happened succesfully, but while executing I get segmentation fault as below -&lt;/P&gt;
&lt;P&gt;------&lt;/P&gt;
&lt;P&gt;Doing simple matrix multiply...&lt;BR /&gt;Doing vector matrix multiply...&lt;BR /&gt;Verifying results...failed at 0,0: -0.03592963096335939632286482492418145 -0.1215526035477607763590768286121602&lt;BR /&gt;failed at 0,1: 0.1015256338444925493513792957855912 0.1927740475185450996775671228533611&lt;BR /&gt;failed at 0,2: -0.102889309227864655937878524127882 0.006359662740173750716810019412150723&lt;BR /&gt;failed at 0,3: 0.09124841367405253644840001925331308 0.2332680556313918018851438773708651&lt;BR /&gt;failed at 2,0: 0.01115496609811927712641033139107094 -0.03592963096335939632286482492418145&lt;BR /&gt;failed at 2,1: -0.03152036281085966729076375258955522 0.1015256338444925493513792957855912&lt;BR /&gt;failed at 2,2: 0.03194373906779557348301068486762233 -0.102889309227864655937878524127882&lt;BR /&gt;failed at 2,3: -0.0283296247066727180374812178342836 0.09124841367405253644840001925331308&lt;BR /&gt;done.&lt;BR /&gt;Segmentation fault&lt;/P&gt;
&lt;P&gt;---&lt;/P&gt;
&lt;P&gt;Since, I am getting "SEGMENTATION FAULT" using Intel Compiler(v-10.0.23), it's confirmed that the SSE2 DP FP code has an alignment problems. The GNU GCC(v-4.4) was doing the same but didn't gave "SEGMENTATION FAULT" but was able to check the calls to emmintrin.h.&lt;/P&gt;
&lt;P&gt;Moreover, while using Intel Debugger(IDB) to debug the SSE2 DP SP code, when I do "info registers", I get -&lt;/P&gt;
&lt;P&gt;-------&lt;/P&gt;
&lt;P&gt;(idb) info registers&lt;BR /&gt;$rax           0x7fbffff060     548682068064&lt;BR /&gt;$rdx           0x7fbffff120     548682068256&lt;BR /&gt;$rcx           0x7fbffff120     548682068256&lt;BR /&gt;$rbx           0x0      0&lt;BR /&gt;$rsi           0x7fbffff0a0     548682068128&lt;BR /&gt;$rdi           0x7fbffff060     548682068064&lt;BR /&gt;$rbp [$fp]     0x7fbffff050     (void *) 0x7fbffff050&lt;BR /&gt;$rsp [$sp]     0x7fbfffed70     (void *) 0x7fbfffed70&lt;BR /&gt;$r8            0x2a9557ae20     182894177824&lt;BR /&gt;$r9            0x0      0&lt;BR /&gt;$r10           0x22     34&lt;BR /&gt;$r11           0x246    582&lt;BR /&gt;$r12           0x400f50 4198224&lt;BR /&gt;$r13           0x7fbffff290     548682068624&lt;BR /&gt;$r14           0x0      0&lt;BR /&gt;$r15           0x0      0&lt;BR /&gt;$orig_rax      0xffffffffffffffff       -1&lt;BR /&gt;$xmm0          0x0      {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}&lt;BR /&gt;$xmm1          0x0      {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}&lt;BR /&gt;$xmm2          0x0      {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}&lt;BR /&gt;$xmm3          0x0      {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}&lt;BR /&gt;$xmm4          0x0      {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}&lt;BR /&gt;$xmm5          0x0      {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}&lt;BR /&gt;$xmm6          0x0      {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}&lt;BR /&gt;$xmm7          0x0      {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}&lt;BR /&gt;$xmm8          0x0      {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}&lt;BR /&gt;$xmm9          0x0      {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}&lt;BR /&gt;$xmm10         0x0      {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}&lt;BR /&gt;$xmm11         0x0      {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}&lt;BR /&gt;$xmm12         0x0      {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}&lt;BR /&gt;$xmm13         0x0      {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}&lt;BR /&gt;$xmm14         0x0      {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}&lt;BR /&gt;$xmm15         0x0      {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}&lt;BR /&gt;$st0           0x0      (unprintable extended double precision float)&lt;BR /&gt;$st1           0x0      (unprintable extended double precision float)&lt;BR /&gt;---Type &lt;RETURN&gt; to continue, or q &lt;RETURN&gt; to quit---&lt;BR /&gt;$st2           0x0      (unprintable extended double precision float)&lt;BR /&gt;$st3           0x0      (unprintable extended double precision float)&lt;BR /&gt;$st4           0x0      (unprintable extended double precision float)&lt;BR /&gt;$st5           0x0      (unprintable extended double precision float)&lt;BR /&gt;$st6           0x3be3c9b52e0000000000   3.2655560360608236672811378540213e-317&lt;BR /&gt;$st7           0x3be3c9b52e0000000000   3.2655560360608236672811378540213e-317&lt;BR /&gt;$rip [$pc]     0x4008ae (void *) 0x4008ae&lt;BR /&gt;$rflags        0x202    514&lt;BR /&gt;-----&lt;/RETURN&gt;&lt;/RETURN&gt;&lt;/P&gt;
&lt;P&gt;Similary, when using GCC(v-4.4) to compile and execute, using GNU GDB, when I do "info registers", I get below -&lt;/P&gt;
&lt;P&gt;---&lt;/P&gt;
&lt;P&gt;(gdb) info registers&lt;BR /&gt;rax            0x7fbffff210     548682068496&lt;BR /&gt;rbx            0x7fbffff1b8     548682068408&lt;BR /&gt;rcx            0x7fbffff1a0     548682068384&lt;BR /&gt;rdx            0x7fbffff120     548682068256&lt;BR /&gt;rsi            0x7fbffff1a0     548682068384&lt;BR /&gt;rdi            0x7fbffff1e0     548682068448&lt;BR /&gt;rbp            0x7fbffff110     0x7fbffff110&lt;BR /&gt;rsp            0x7fbfffef30     0x7fbfffef30&lt;BR /&gt;r8             0x1      1&lt;BR /&gt;r9             0x0      0&lt;BR /&gt;r10            0x22     34&lt;BR /&gt;r11            0x246    582&lt;BR /&gt;r12            0x401410 4199440&lt;BR /&gt;r13            0x7fbffff340     548682068800&lt;BR /&gt;r14            0x0      0&lt;BR /&gt;r15            0x0      0&lt;BR /&gt;rip            0x400921 0x400921 &lt;MYMATRIXMULTIPLY&gt;&lt;BR /&gt;eflags         0x302    770&lt;BR /&gt;cs             0x33     51&lt;BR /&gt;ss             0x2b     43&lt;BR /&gt;ds             0x0      0&lt;BR /&gt;es             0x0      0&lt;BR /&gt;fs             0x0      0&lt;BR /&gt;gs             0x0      0&lt;/MYMATRIXMULTIPLY&gt;&lt;/P&gt;
&lt;P&gt;---&lt;/P&gt;
&lt;P&gt;Queries:&lt;/P&gt;
&lt;P&gt;(a)  How  to get rid of ALIGNMENT problems for SSE2 DP FP code?&lt;/P&gt;
&lt;P&gt;(b) Why there is differences in contents when one check the values given by rax &amp;amp; rcx registers both using GNU GDB &amp;amp; Intel IDB?&lt;/P&gt;
&lt;P&gt;(c) rbx register content is empty while using IDB, while GDB says 548682068408, why rbx is empty as shown by IDB?&lt;/P&gt;
&lt;P&gt;(d) rdx register contens both given by IDB &amp;amp; GDB are same, why so?&lt;/P&gt;
&lt;P&gt;Note: Above operations has been done on two separate consoles, one having environment properties of GCC(v-4.4) Compiler and another having ennvironment properties of Intel C++ Compiler(v-10.0.23). Here SSE2 DP FP C based code is shown in Reply#5 as above.&lt;/P&gt;
&lt;P&gt;~BR&lt;/P&gt;</description>
      <pubDate>Mon, 15 Dec 2008 07:13:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Help-Vectorization-x87-SSE2/m-p/893186#M2656</guid>
      <dc:creator>srimks</dc:creator>
      <dc:date>2008-12-15T07:13:15Z</dc:date>
    </item>
    <item>
      <title>Re: Help: Vectorization, x87 &amp; SSE2</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Help-Vectorization-x87-SSE2/m-p/893187#M2657</link>
      <description>&lt;DIV style="margin:0px;"&gt;&lt;/DIV&gt;
&lt;P&gt;For static array declarations, "__declspec(align(16))" should enforce 16-byte alignment. For dynamic allocation, you can use _mm_malloc() and _mm_free(). I believe these are in 'xmmintrin.h'. You could also do your own brute-force alignment by allocating 15 extra bytes and re-assigning the first 16-byte-aligned address to a new pointer.&lt;/P&gt;
&lt;P&gt;I noticed your two register dumps have different instruction pointer ($rip) values so maybe you are breaking at two different points in the program? Anyway you just need to isolate the instruction that caused the seg fault and see if it is an issue with 16-byte alignment. If it is, use one of the above alignment techniques and see if that resolves it.&lt;/P&gt;
&lt;P&gt;Mike&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 16 Dec 2008 21:50:17 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Help-Vectorization-x87-SSE2/m-p/893187#M2657</guid>
      <dc:creator>Michael_S_Intel8</dc:creator>
      <dc:date>2008-12-16T21:50:17Z</dc:date>
    </item>
  </channel>
</rss>

