Solved: Re: Help: Vectorization, x87 & SSE2

srimks · ‎12-01-2008

Hi All.

I am new with SIMD programming, had a peice of code as below -

--

#include
#include
#include

#include

// Set up a vector type for a float[4] array for each vector type
typedef __m128 vFloat;
// Also define some macros to map a virtual SIMD language to
// each actual SIMD language.

// Note that because i MUST be an immediate, it is incorrect here
// to alias i to a stackbased copy and replicate that 4 times.
#define vSplat( v, i ) ({ __m128 a = v; a = _mm_shuffle_ps( a, a, _MM_SHUFFLE(i,i,i,i) ); a; })

inline __m128 vMADD( __m128 a, __m128 b, __m128 c )
{
return _mm_add_ps( c, _mm_mul_ps( a, b ) );
}

#define vLoad( ptr ) _mm_load_ps( (float*) (ptr) )
#define vStore( v, ptr ) _mm_store_ps( (float*) (ptr), v )
#define vZero() _mm_setzero_ps()

#define empty() _m_empty(); //clears the SSE2 registers and SSE2 state.

// Prototype for a vector matrix multiply function
void MyMatrixMultiply( vFloat A[4], vFloat B[4], vFloat C[4] );

int main( void )
{
// The vFloat type (defined previously) is a vector or scalar array that contains 4 floats
// Thus each one of these is a 10x10 matrix, stored in the C storage order.
vFloat A[10];
vFloat B[10];
vFloat C1[10];
vFloat C2[10];
int i, j, k;

// Pointers to the elements in A, B, C1 and C2
float *a = (float*) &A;
float *b = (float*) &B;
float *c1 = (float*) &C1;
float *c2 = (float*) &C2;

// Initialize the data
for( i = 0; i < 100; i++ )
{
a = (double) (rand() - RAND_MAX/2) / (double) (RAND_MAX );
b = (double) (rand() - RAND_MAX/2) / (double) (RAND_MAX );
c1 = c2 = 0.0;
}

// Perform the brute-force version of matrix multiplication and use this later to check for correctness
printf( "Doing simple matrix multiply...\n" );
for( i = 0; i < 10; i++ )
for( j = 0; j < 10; j++ )
{
float result = 0.0f;
for( k = 0; k < 10; k++ )
result += a[ i * 10 + k] * b[ k * 10 + j ];
c1[ i * 10 + j ] = result;
}

// The vector version
printf( "Doing vector matrix multiply...\n" );
MyMatrixMultiply( A, B, C2 );

// Make sure that the results are correct
// Allow for some rounding error here
printf( "Verifying results..." );
for( i = 0 ; i < 100; i++ )
if( fabs( c1 - c2 ) > 1e-6 )
printf( "failed at %i,%i: %8.17g %8.17g\n", i/4, i&3, c1, c2 );

printf( "done.\n" );

return 0;
}

void MyMatrixMultiply( vFloat A[16], vFloat B[16], vFloat C[16] )
{

vFloat A1 = vLoad( A ); //Row 1 of A
vFloat A2 = vLoad( A + 1 ); //Row 2 of A
vFloat A3 = vLoad( A + 2 ); //Row 3 of A
vFloat A4 = vLoad( A + 3); //Row 4 of A
vFloat C1 = vZero(); //Row 1 of C, initialized to zero
vFloat C2 = vZero(); //Row 2 of C, initialized to zero
vFloat C3 = vZero(); //Row 3 of C, initialized to zero
vFloat C4 = vZero(); //Row 4 of C, initialized to zero
vFloat B1 = vLoad( B ); //Row 1 of B
vFloat B2 = vLoad( B + 1 ); //Row 2 of B
vFloat B3 = vLoad( B + 2 ); //Row 3 of B
vFloat B4 = vLoad( B + 3); //Row 4 of B
//Multiply the first row of B by the first column of A (do not sum across)
C1 = vMADD( vSplat( A1, 0 ), B1, C1 );
C2 = vMADD( vSplat( A2, 0 ), B1, C2 );
C3 = vMADD( vSplat( A3, 0 ), B1, C3 );
C4 = vMADD( vSplat( A4, 0 ), B1, C4 );
// Multiply the second row of B by the second column of A and
// add to the previous result (do not sum across)
C1 = vMADD( vSplat( A1, 1 ), B2, C1 );
C2 = vMADD( vSplat( A2, 1 ), B2, C2 );
C3 = vMADD( vSplat( A3, 1 ), B2, C3 );
C4 = vMADD( vSplat( A4, 1 ), B2, C4 );
// Multiply the third row of B by the third column of A and
// add to the previous result (do not sum across)
C1 = vMADD( vSplat( A1, 2 ), B3, C1 );
C2 = vMADD( vSplat( A2, 2 ), B3, C2 );
C3 = vMADD( vSplat( A3, 2 ), B3, C3 );
C4 = vMADD( vSplat( A4, 2 ), B3, C4 );
// Multiply the fourth row of B by the fourth column of A and
// add to the previous result (do not sum across)
C1 = vMADD( vSplat( A1, 3 ), B4, C1 );
C2 = vMADD( vSplat( A2, 3 ), B4, C2 );
C3 = vMADD( vSplat( A3, 3 ), B4, C3 );
C4 = vMADD( vSplat( A4, 3 ), B4, C4 );
// Write out the result to the destination
vStore( C1, C );
vStore( C2, C + 1 );
vStore( C3, C + 2 );
vStore( C4, C + 3 );

empty(); //clears the SSE2 registers and SSE2 state.

}

------

I have following queries though it could be simple for others, they are -

(a) What does complete content in L#14 mean?
-----
#define vSplat( v, i ) ({ __m128 a = v; a = _mm_shuffle_ps( a, a, _MM_SHUFFLE(i,i,i,i) ); a; })
-----
Could it be elaborated with sequences as mentioned below for above MACRO.
"__m128 a = v;
a = _mm_shuffle_ps( a, a, _MM_SHUFFLE(i,i,i,i) );
a;

(b) In L#25, I had defined "_m_empty()" for SSe2 which is being called in MyMatrixMultiply() API at the end,

does use of it at the end can improve performance or simply not needed for SSE2?

(c) Within this code, can I perform -
(i) Vectorization with Compiler FLAGS(Intel/GNU) for FOR loops for L#56 - 63?
(ii) This code has been written to address SSE2, could it be checked for x87 FP operations registers too, if YES, how?
(iii) Can a comparison be generated both for x87 & SSE2 FP registers in performances, if YES - how it can be performed & analyzed?

Appreciate you responses for above.

I do refer Intel C++ Intrinisic Reference & Intel C++ Compiler documents.

~BR

Michael_S_Intel8 · ‎12-02-2008

Hello,

a) ThevSplat macro is an abstraction for the SSE SHUFPS instruction. Specifically it willperform a broadcast of element 'i' to every field in the XMM register, where 'i' is a value from 0-3. I think the way it is coded will leave the original value 'v' untouched and return the broadcast result 'a' in a different register. I am not sure what "a;" will do outside the macro. I think it must be a way of returning that value from the macro.

b) _m_empty() is an intrinsic for the EMMS instruction which clears the MMX stateto avoid aliasing between the MMX registers and the x87 FP stack. It isn't needed for SSE routines. We recommend that any existing MMX code should be ported to SSE2 for best efficiency.

c)

i) The matrix multiply loopsmight vectorize, especially if you add "#pragma vector always" and work with the vectorization report. I'm thinking the vectorizer might have a hard time realizing the block-unroll-jam transformation that makes your intrinsics code work efficiently.

ii) I am not sure what you mean by checking for x87 code? If you compile this as a 32-bit app with no instruction set target (QxK, QxW, QxP, etc.) then the compiler will generate x87 FP instructions for the loops.

iii) It seems like you want to compare results computed with x87 and SSE instructions. Compiling with no instruction set target should give you this comparison. If you compile as 64-bit or use /QxK, /QxW, etc. then the loops will probably be coded with SSE scalar or packed instructions.

Note, the Intel Math Kernel Library contains fast matrix multiply routines that have been tuned to the metal for the latest Intel CPU's.

Regards,

Mike Stoner

Software Engineer

Intel SSG - Application Performance

View solution in original post

Michael_S_Intel8 · ‎12-02-2008

Hello,

a) ThevSplat macro is an abstraction for the SSE SHUFPS instruction. Specifically it willperform a broadcast of element 'i' to every field in the XMM register, where 'i' is a value from 0-3. I think the way it is coded will leave the original value 'v' untouched and return the broadcast result 'a' in a different register. I am not sure what "a;" will do outside the macro. I think it must be a way of returning that value from the macro.

b) _m_empty() is an intrinsic for the EMMS instruction which clears the MMX stateto avoid aliasing between the MMX registers and the x87 FP stack. It isn't needed for SSE routines. We recommend that any existing MMX code should be ported to SSE2 for best efficiency.

c)

i) The matrix multiply loopsmight vectorize, especially if you add "#pragma vector always" and work with the vectorization report. I'm thinking the vectorizer might have a hard time realizing the block-unroll-jam transformation that makes your intrinsics code work efficiently.

ii) I am not sure what you mean by checking for x87 code? If you compile this as a 32-bit app with no instruction set target (QxK, QxW, QxP, etc.) then the compiler will generate x87 FP instructions for the loops.

iii) It seems like you want to compare results computed with x87 and SSE instructions. Compiling with no instruction set target should give you this comparison. If you compile as 64-bit or use /QxK, /QxW, etc. then the loops will probably be coded with SSE scalar or packed instructions.

Note, the Intel Math Kernel Library contains fast matrix multiply routines that have been tuned to the metal for the latest Intel CPU's.

Regards,

Mike Stoner

Software Engineer

Intel SSG - Application Performance

TimP · ‎12-02-2008

If you want to compare x87 and SSE2 results, you should use linux and plain Fortran or C source code. Intel compilers do present automatic unroll-and-jam optimization for matrix multiply with options -xP -O3. You should be able to drop back to full precision x87 code with -mp -pc80 -long-double. Recent gnu compilers will perform SSE2 vectorization with options -O3 -mpfmath=sse -ffast-math, and x87 code with -O3 -mfpmath=387.

x87 on Windows is generally available only in 32-bit compilers, and the default for Microsoft compatibility is to set x87 53-bit precision mode. I don't see any value in an x87 comparison, unless you use 64-bit precision accumulation.

You should be able to approach optimum performance without resorting to low level coding such as you suggested. In fact, it may be quite difficult to arrive at adequate base performance and correctness if you start with intrinsics. Optimization with threading is much more difficult, but still should be possible with C or Fortran source with OpenMP. As Mike suggested, you should consider MKL as a reference for professionally optimzed code, both threaded and not. It has the advantage of being a plug-in replacement for the netlib BLAS source code.

srimks · ‎12-02-2008

Quoting - Michael Stoner (Intel)

Hello,

a) The vSplat macro is an abstraction for the SSE SHUFPS instruction. Specifically it will perform a broadcast of element 'i' to every field in the XMM register, where 'i' is a value from 0-3. I think the way it is coded will leave the original value 'v' untouched and return the broadcast result 'a' in a different register. I am not sure what "a;" will do outside the macro. I think it must be a way of returning that value from the macro.

b) _m_empty() is an intrinsic for the EMMS instruction which clears the MMX state to avoid aliasing between the MMX registers and the x87 FP stack. It isn't needed for SSE routines. We recommend that any existing MMX code should be ported to SSE2 for best efficiency.

c)

i) The matrix multiply loops might vectorize, especially if you add "#pragma vector always" and work with the vectorization report. I'm thinking the vectorizer might have a hard time realizing the block-unroll-jam transformation that makes your intrinsics code work efficiently.

ii) I am not sure what you mean by checking for x87 code? If you compile this as a 32-bit app with no instruction set target (QxK, QxW, QxP, etc.) then the compiler will generate x87 FP instructions for the loops.

iii) It seems like you want to compare results computed with x87 and SSE instructions. Compiling with no instruction set target should give you this comparison. If you compile as 64-bit or use /QxK, /QxW, etc. then the loops will probably be coded with SSE scalar or packed instructions.

Note, the Intel Math Kernel Library contains fast matrix multiply routines that have been tuned to the metal for the latest Intel CPU's.

Regards,

Mike Stoner

Software Engineer

Intel SSG - Application Performance

Thanks Michael for your valuable inputs, will look forward now.

Could you try answering this thread http://software.intel.com/en-us/forums/showthread.php?t=62183 if possible.

~BR

srimks · ‎12-02-2008

Quoting - tim18

If you want to compare x87 and SSE2 results, you should use linux and plain Fortran or C source code. Intel compilers do present automatic unroll-and-jam optimization for matrix multiply with options -xP -O3. You should be able to drop back to full precision x87 code with -mp -pc80 -long-double. Recent gnu compilers will perform SSE2 vectorization with options -O3 -mpfmath=sse, and x87 code with -O3 -mfpmath=387.

x87 on Windows is generally available only in 32-bit compilers, and the default for Microsoft compatibility is to set x87 53-bit precision mode. I don't see any value in an x87 comparison, unless you use 64-bit precision accumulation.

You should be able to approach optimum performance without resorting to low level coding such as you suggested. In fact, it may be quite difficult to arrive at adequate base performance and correctness if you start with intrinsics. Optimization with threading is much more difficult, but still should be possible with C or Fortran source with OpenMP. As Mike suggested, you should consider MKL as a reference for professionally optimzed code, both threaded and not. It has the advantage of being a plug-in replacement for the netlib BLAS source code.

Thanks Tim for your valuable inputs, will look forward now.

~BR

srimks · ‎12-09-2008

Hi Michael.

I did modified above code for SSE2 DP FP analysis as below -

---

#include
#include
#include

#include

// Set up a vector type for a float[4] array for each vector type
typedef __m128d vFloat;

// Note that i is a mask which is an immediate
// Selects two specific DP FP values from a and b, based on the mask i.
// The mask must be an immediate. (SHUFPD)
#define vSplat( v, i ) ({ __m128d a = v; a = _mm_shuffle_pd( a, a,
_MM_SHUFFLE(i,i,i,i) ); a; })

inline __m128d vMADD(__m128d a, __m128d b, __m128d c)
{
return _mm_add_pd( c, _mm_mul_pd( a, b ) );
}

#define vLoad( ptr ) _mm_load_pd( (double const *) (ptr)) // Loads two DP FP values (MOVAPD)
#define vStore( v, ptr ) _mm_store_pd( (double*) (ptr), v ) // Store two DP FP values (MOVAPD)
#define vZero() _mm_setzero_pd() // Sets two DP FP values to zero (XORPD)

// Prototype for a vector matrix multiply function
void MyMatrixMultiply( vFloat A[4], vFloat B[4], vFloat C[4] );

int main( void )
{
// The vFloat type (defined previously) is a vector array that contains 2 double
// Thus each one of these is a 4x4 matrix, stored in the C storage order.
vFloat A[4];
vFloat B[4];
vFloat C1[4];
vFloat C2[4];
int i, j, k;

// Pointers to the elements in A, B, C1 and C2
double *a = (double*) &A;
double *b = (double*) &B;
double *c1 = (double*) &C1;
double *c2 = (double*) &C2;

// Initialize the data
for( i = 0; i < 4; i++ )
{
a = (double) (rand() - RAND_MAX/2) / (double) (RAND_MAX );
b = (double) (rand() - RAND_MAX/2) / (double) (RAND_MAX );
c1 = c2 = 0.0;
}
// Perform matrix multiplication and use this later to check for correctness
printf( "Doing simple matrix multiply...n" );
for( i = 0; i < 4; i++ )
for( j = 0; j < 4; j++ )
{
double result = 0.0f;
for( k = 0; k < 4; k++ )
result += a[ i * 4 + k] * b[ k * 4 + j ];
c1[ i * 4 + j ] = result;
}

// The vector version
printf( "Doing vector matrix multiply...n" );
MyMatrixMultiply( A, B, C2 );

// Make sure that the results are correct allow for some rounding error here
printf( "Verifying results..." );
for( i = 0 ; i < 16; i++ )
if( fabs( c1 - c2 ) > 1e-20 )
printf( "failed at %i,%i: %8.34g %8.34gn", i/4, i&3, c1, c2 );

printf( "done.n" );

return 0;

}

void MyMatrixMultiply( vFloat A[4], vFloat B[4], vFloat C[4] )
{
vFloat A1 = vLoad( A ); //Row 1 of A
vFloat A2 = vLoad( A + 1 ); //Row 2 of A
vFloat A3 = vLoad( A + 2 ); //Row 3 of A
vFloat A4 = vLoad( A + 3); //Row 4 of A

vFloat C1 = vZero(); //Row 1 of C, initialized to zero
vFloat C2 = vZero(); //Row 2 of C, initialized to zero
vFloat C3 = vZero(); //Row 3 of C, initialized to zero
vFloat C4 = vZero(); //Row 4 of C, initialized to zero

vFloat B1 = vLoad( B ); //Row 1 of B
vFloat B2 = vLoad( B + 1 ); //Row 2 of B
vFloat B3 = vLoad( B + 2 ); //Row 3 of B
vFloat B4 = vLoad( B + 3 ); //Row 4 of B

//Multiply the first row of B by the first column of A (do not sum across)
C1 = vMADD( vSplat( A1, 0 ), B1, C1 );
C2 = vMADD( vSplat( A2, 0 ), B1, C2 );
C3 = vMADD( vSplat( A3, 0 ), B1, C3 );
C4 = vMADD( vSplat( A4, 0 ), B1, C4 );

// Multiply the second row of B by the second column of A and
// add to the previous result (do not sum across)
C1 = vMADD( vSplat( A1, 1 ), B2, C1 );
C2 = vMADD( vSplat( A2, 1 ), B2, C2 );
C3 = vMADD( vSplat( A3, 1 ), B2, C3 );
C4 = vMADD( vSplat( A4, 1 ), B2, C4 );

// Multiply the third row of B by the third column of A and
// add to the previous result (do not sum across)
C1 = vMADD( vSplat( A1, 2 ), B3, C1 );
C2 = vMADD( vSplat( A2, 2 ), B3, C2 );
C3 = vMADD( vSplat( A3, 2 ), B3, C3 );
C4 = vMADD( vSplat( A4, 2 ), B3, C4 );

// Multiply the fourth row of B by the fourth column of A and
// add to the previous result (do not sum across)
C1 = vMADD( vSplat( A1, 3 ), B4, C1 );
C2 = vMADD( vSplat( A2, 3 ), B4, C2 );
C3 = vMADD( vSplat( A3, 3 ), B4, C3 );
C4 = vMADD( vSplat( A4, 3 ), B4, C4 );

// Write out the result to the destination

vStore( C1, C );

vStore( C2, C + 1 );
vStore( C3, C + 2 );
vStore( C4, C + 3 );
}
---

I tried compiling using GNU(v4.4), it was fine but when executing, I did get below errors -

--

time ./matrix-4X4-sse2
Doing simple matrix multiply...
Doing vector matrix multiply...
Verifying results...failed at 0,0: 0.1041077441809730858013338661294256 1.615112779972704616472675241484077e-310
failed at 0,1: 0.2180626815004354512872453142335871 2.329319807572869703155260521236589e-311
failed at 0,2: 0.06656423826015861466842693516809959 5.021118224839903602770002086837975e-311
failed at 0,3: 0.02341829647346145570896425169848953 -1.35546831104864329338482674488192e-311
failed at 2,0: 1.443534677236949856111745706357672e-314 0.1041077441809730858013338661294256
failed at 2,1: 1.202089919575080043699319648301355e-314 0.2180626815004354512872453142335871
failed at 2,2: 1.74640041019361879362255661637412e-314 0.06656423826015861466842693516809959
failed at 2,3: -6.987567207824738989361744406306725e-315 0.02341829647346145570896425169848953

done.

--

I did looked for debugging, probably found -

While analyzing the row matrix (MyMatrixMultiply()) API, L# 120, the "return (__m128)*(__v4DF)__P" didn't happened, directly the debugger jumped to " return __extension__ (__m128d){ 0.0, 0.0 }" of emmintrin.h SSE2 instructions header file.

The previous SSE SP FP code as posted earlier did had above debugging flow within xmmintrin.h instructions header file.

Did I miss anything while modifying SSE SP FP earlier code to SSE2 DP SP code of today?

Any clue?

~BR

srimks · ‎12-13-2008

Quoting - srimks

Hi Michael.

I did modified above code for SSE2 DP FP analysis as below -

---

#include
#include
#include

#include

// Set up a vector type for a float[4] array for each vector type
typedef __m128d vFloat;

// Note that i is a mask which is an immediate
// Selects two specific DP FP values from a and b, based on the mask i.
// The mask must be an immediate. (SHUFPD)
#define vSplat( v, i ) ({ __m128d a = v; a = _mm_shuffle_pd( a, a,
_MM_SHUFFLE(i,i,i,i) ); a; })

inline __m128d vMADD(__m128d a, __m128d b, __m128d c)
{
return _mm_add_pd( c, _mm_mul_pd( a, b ) );
}

#define vLoad( ptr ) _mm_load_pd( (double const *) (ptr)) // Loads two DP FP values (MOVAPD)
#define vStore( v, ptr ) _mm_store_pd( (double*) (ptr), v ) // Store two DP FP values (MOVAPD)
#define vZero() _mm_setzero_pd() // Sets two DP FP values to zero (XORPD)

// Prototype for a vector matrix multiply function
void MyMatrixMultiply( vFloat A[4], vFloat B[4], vFloat C[4] );

int main( void )
{
// The vFloat type (defined previously) is a vector array that contains 2 double
// Thus each one of these is a 4x4 matrix, stored in the C storage order.
vFloat A[4];
vFloat B[4];
vFloat C1[4];
vFloat C2[4];
int i, j, k;

// Pointers to the elements in A, B, C1 and C2
double *a = (double*) &A;
double *b = (double*) &B;
double *c1 = (double*) &C1;
double *c2 = (double*) &C2;

// Initialize the data
for( i = 0; i < 4; i++ )
{
a = (double) (rand() - RAND_MAX/2) / (double) (RAND_MAX );
b = (double) (rand() - RAND_MAX/2) / (double) (RAND_MAX );
c1 = c2 = 0.0;
}
// Perform matrix multiplication and use this later to check for correctness
printf( "Doing simple matrix multiply...n" );
for( i = 0; i < 4; i++ )
for( j = 0; j < 4; j++ )
{
double result = 0.0f;
for( k = 0; k < 4; k++ )
result += a[ i * 4 + k] * b[ k * 4 + j ];
c1[ i * 4 + j ] = result;
}

// The vector version
printf( "Doing vector matrix multiply...n" );
MyMatrixMultiply( A, B, C2 );

// Make sure that the results are correct allow for some rounding error here
printf( "Verifying results..." );
for( i = 0 ; i < 16; i++ )
if( fabs( c1 - c2 ) > 1e-20 )
printf( "failed at %i,%i: %8.34g %8.34gn", i/4, i&3, c1, c2 );

printf( "done.n" );

return 0;

}

void MyMatrixMultiply( vFloat A[4], vFloat B[4], vFloat C[4] )
{
vFloat A1 = vLoad( A ); //Row 1 of A
vFloat A2 = vLoad( A + 1 ); //Row 2 of A
vFloat A3 = vLoad( A + 2 ); //Row 3 of A
vFloat A4 = vLoad( A + 3); //Row 4 of A

vFloat C1 = vZero(); //Row 1 of C, initialized to zero
vFloat C2 = vZero(); //Row 2 of C, initialized to zero
vFloat C3 = vZero(); //Row 3 of C, initialized to zero
vFloat C4 = vZero(); //Row 4 of C, initialized to zero

vFloat B1 = vLoad( B ); //Row 1 of B
vFloat B2 = vLoad( B + 1 ); //Row 2 of B
vFloat B3 = vLoad( B + 2 ); //Row 3 of B
vFloat B4 = vLoad( B + 3 ); //Row 4 of B

//Multiply the first row of B by the first column of A (do not sum across)
C1 = vMADD( vSplat( A1, 0 ), B1, C1 );
C2 = vMADD( vSplat( A2, 0 ), B1, C2 );
C3 = vMADD( vSplat( A3, 0 ), B1, C3 );
C4 = vMADD( vSplat( A4, 0 ), B1, C4 );

// Multiply the second row of B by the second column of A and
// add to the previous result (do not sum across)
C1 = vMADD( vSplat( A1, 1 ), B2, C1 );
C2 = vMADD( vSplat( A2, 1 ), B2, C2 );
C3 = vMADD( vSplat( A3, 1 ), B2, C3 );
C4 = vMADD( vSplat( A4, 1 ), B2, C4 );

// Multiply the third row of B by the third column of A and
// add to the previous result (do not sum across)
C1 = vMADD( vSplat( A1, 2 ), B3, C1 );
C2 = vMADD( vSplat( A2, 2 ), B3, C2 );
C3 = vMADD( vSplat( A3, 2 ), B3, C3 );
C4 = vMADD( vSplat( A4, 2 ), B3, C4 );

// Multiply the fourth row of B by the fourth column of A and
// add to the previous result (do not sum across)
C1 = vMADD( vSplat( A1, 3 ), B4, C1 );
C2 = vMADD( vSplat( A2, 3 ), B4, C2 );
C3 = vMADD( vSplat( A3, 3 ), B4, C3 );
C4 = vMADD( vSplat( A4, 3 ), B4, C4 );

// Write out the result to the destination

vStore( C1, C );

vStore( C2, C + 1 );
vStore( C3, C + 2 );
vStore( C4, C + 3 );
}
---

I tried compiling using GNU(v4.4), it was fine but when executing, I did get below errors -

--

time ./matrix-4X4-sse2
Doing simple matrix multiply...
Doing vector matrix multiply...
Verifying results...failed at 0,0: 0.1041077441809730858013338661294256 1.615112779972704616472675241484077e-310
failed at 0,1: 0.2180626815004354512872453142335871 2.329319807572869703155260521236589e-311
failed at 0,2: 0.06656423826015861466842693516809959 5.021118224839903602770002086837975e-311
failed at 0,3: 0.02341829647346145570896425169848953 -1.35546831104864329338482674488192e-311
failed at 2,0: 1.443534677236949856111745706357672e-314 0.1041077441809730858013338661294256
failed at 2,1: 1.202089919575080043699319648301355e-314 0.2180626815004354512872453142335871
failed at 2,2: 1.74640041019361879362255661637412e-314 0.06656423826015861466842693516809959
failed at 2,3: -6.987567207824738989361744406306725e-315 0.02341829647346145570896425169848953

done.

--

I did looked for debugging, probably found -

While analyzing the row matrix (MyMatrixMultiply()) API, L# 120, the "return (__m128)*(__v4DF)__P" didn't happened, directly the debugger jumped to " return __extension__ (__m128d){ 0.0, 0.0 }" of emmintrin.h SSE2 instructions header file.

The previous SSE SP FP code as posted earlier did had above debugging flow within xmmintrin.h instructions header file.

Did I miss anything while modifying SSE SP FP earlier code to SSE2 DP SP code of today?

Any clue?

~BR

I probably think below instructions as used in above codehas some alignment problems, the instructions used are -

#define vLoad( ptr ) _mm_load_pd( (double const *) (ptr)) // Loads two DP FP values (MOVAPD)
#define vStore( v, ptr ) _mm_store_pd( (double*) (ptr), v ) // Store two DP FP values (MOVAPD)

Could anyone suggest how to define proper "declspec(align(16))" or any alignment statement such that above instructions doesn't have alignment problems? This code when tested with SSE SP DP on Linux was succesfully executed but has problems in executionwith SSE2 DP FP on Linux

Sorry for being naive.

~BR

TimP · ‎12-13-2008

Quoting - srimks

Could anyone suggest how to define proper "declspec(align(16))" or any alignment statement such that above instructions doesn't have alignment problems? This code when tested with SSE SP DP on Linux was succesfully executed but has problems in executionwith SSE2 DP FP on Linux

Do you mean you're having difficulty finding the manual pages on gcc __attribute__(align(16)) ?

It's in info gcc, and also on the web:

http://gcc.gnu.org/onlinedocs/gcc-3.2.3/gcc/Variable-Attributes.html

icc should support both __attribute__ and declspec. Both of those will need to support 32 byte alignment, according to the main topic of this forum, but gcc currently supports only the 16-byte subset of AVX.

We're already past the mid point of the useable lifetime of SSE2 intrinsic coding; it may already be considered "legacy" stuff when AVX hardware arrives. It would be interesting to see to what extent binary translation might be capable of turning SSE2 into AVX over the next 3 years.

srimks · ‎12-13-2008

Quoting - tim18

Do you mean you're having difficulty finding the manual pages on gcc __attribute__(align(16)) ?

It's in info gcc, and also on the web:

http://gcc.gnu.org/onlinedocs/gcc-3.2.3/gcc/Variable-Attributes.html

icc should support both __attribute__ and declspec. Both of those will need to support 32 byte alignment, according to the main topic of this forum, but gcc currently supports only the 16-byte subset of AVX.

We're already past the mid point of the useable lifetime of SSE2 intrinsic coding; it may already be considered "legacy" stuff when AVX hardware arrives. It would be interesting to see to what extent binary translation might be capable of turning SSE2 into AVX over the next 3 years.

No Tim.

Simply, I mean here that this above code which has been designed for SSE2 DP SP has alignment problems. It doesn't has any problem in calling emmintrin.h fileof GCC, it's perefectly fine which I came to know while debugging.

I think if I can take care of alignment for arguements (ptr, v)being passed in _mm_load_pd() & _mm_store_pd() instructions, it should be fine. How to define these alignment for arguement being passed to these instructions is the query?

Thanks for your input though.

~BR

srimks · ‎12-14-2008

Quoting - srimks

I probably think below instructions as used in above code has some alignment problems, the instructions used are -

#define vLoad( ptr ) _mm_load_pd( (double const *) (ptr)) // Loads two DP FP values (MOVAPD)
#define vStore( v, ptr ) _mm_store_pd( (double*) (ptr), v ) // Store two DP FP values (MOVAPD)

Could anyone suggest how to define proper "declspec(align(16))" or any alignment statement such that above instructions doesn't have alignment problems? This code when tested with SSE SP DP on Linux was succesfully executed but has problems in execution with SSE2 DP FP on Linux

Sorry for being naive.

~BR

Hi All.

While using Intel C++ Compiler(v-10.0.23), the compilation happened succesfully, but while executing I get segmentation fault as below -

------

Doing simple matrix multiply...
Doing vector matrix multiply...
Verifying results...failed at 0,0: -0.03592963096335939632286482492418145 -0.1215526035477607763590768286121602
failed at 0,1: 0.1015256338444925493513792957855912 0.1927740475185450996775671228533611
failed at 0,2: -0.102889309227864655937878524127882 0.006359662740173750716810019412150723
failed at 0,3: 0.09124841367405253644840001925331308 0.2332680556313918018851438773708651
failed at 2,0: 0.01115496609811927712641033139107094 -0.03592963096335939632286482492418145
failed at 2,1: -0.03152036281085966729076375258955522 0.1015256338444925493513792957855912
failed at 2,2: 0.03194373906779557348301068486762233 -0.102889309227864655937878524127882
failed at 2,3: -0.0283296247066727180374812178342836 0.09124841367405253644840001925331308
done.
Segmentation fault

---

Since, I am getting "SEGMENTATION FAULT" using Intel Compiler(v-10.0.23), it's confirmed that the SSE2 DP FP code has an alignment problems. The GNU GCC(v-4.4) was doing the same but didn't gave "SEGMENTATION FAULT" but was able to check the calls to emmintrin.h.

Moreover, while using Intel Debugger(IDB) to debug the SSE2 DP SP code, when I do "info registers", I get -

-------

(idb) info registers
$rax 0x7fbffff060 548682068064
$rdx 0x7fbffff120 548682068256
$rcx 0x7fbffff120 548682068256
$rbx 0x0 0
$rsi 0x7fbffff0a0 548682068128
$rdi 0x7fbffff060 548682068064
$rbp [$fp] 0x7fbffff050 (void *) 0x7fbffff050
$rsp [$sp] 0x7fbfffed70 (void *) 0x7fbfffed70
$r8 0x2a9557ae20 182894177824
$r9 0x0 0
$r10 0x22 34
$r11 0x246 582
$r12 0x400f50 4198224
$r13 0x7fbffff290 548682068624
$r14 0x0 0
$r15 0x0 0
$orig_rax 0xffffffffffffffff -1
$xmm0 0x0 {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}
$xmm1 0x0 {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}
$xmm2 0x0 {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}
$xmm3 0x0 {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}
$xmm4 0x0 {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}
$xmm5 0x0 {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}
$xmm6 0x0 {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}
$xmm7 0x0 {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}
$xmm8 0x0 {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}
$xmm9 0x0 {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}
$xmm10 0x0 {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}
$xmm11 0x0 {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}
$xmm12 0x0 {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}
$xmm13 0x0 {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}
$xmm14 0x0 {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}
$xmm15 0x0 {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}}
$st0 0x0 (unprintable extended double precision float)
$st1 0x0 (unprintable extended double precision float)
---Type to continue, or q to quit---
$st2 0x0 (unprintable extended double precision float)
$st3 0x0 (unprintable extended double precision float)
$st4 0x0 (unprintable extended double precision float)
$st5 0x0 (unprintable extended double precision float)
$st6 0x3be3c9b52e0000000000 3.2655560360608236672811378540213e-317
$st7 0x3be3c9b52e0000000000 3.2655560360608236672811378540213e-317
$rip [$pc] 0x4008ae (void *) 0x4008ae
$rflags 0x202 514
-----

Similary, when using GCC(v-4.4) to compile and execute, using GNU GDB, when I do "info registers", I get below -

---

(gdb) info registers
rax 0x7fbffff210 548682068496
rbx 0x7fbffff1b8 548682068408
rcx 0x7fbffff1a0 548682068384
rdx 0x7fbffff120 548682068256
rsi 0x7fbffff1a0 548682068384
rdi 0x7fbffff1e0 548682068448
rbp 0x7fbffff110 0x7fbffff110
rsp 0x7fbfffef30 0x7fbfffef30
r8 0x1 1
r9 0x0 0
r10 0x22 34
r11 0x246 582
r12 0x401410 4199440
r13 0x7fbffff340 548682068800
r14 0x0 0
r15 0x0 0
rip 0x400921 0x400921
eflags 0x302 770
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0

---

Queries:

(a) How to get rid of ALIGNMENT problems for SSE2 DP FP code?

(b) Why there is differences in contents when one check the values given by rax & rcx registers both using GNU GDB & Intel IDB?

(c) rbx register content is empty while using IDB, while GDB says 548682068408, why rbx is empty as shown by IDB?

(d) rdx register contens both given by IDB & GDB are same, why so?

Note: Above operations has been done on two separate consoles, one having environment properties of GCC(v-4.4) Compiler and another having ennvironment properties of Intel C++ Compiler(v-10.0.23). Here SSE2 DP FP C based code is shown in Reply#5 as above.

~BR

Michael_S_Intel8 · ‎12-16-2008

For static array declarations, "__declspec(align(16))" should enforce 16-byte alignment. For dynamic allocation, you can use _mm_malloc() and _mm_free(). I believe these are in 'xmmintrin.h'. You could also do your own brute-force alignment by allocating 15 extra bytes and re-assigning the first 16-byte-aligned address to a new pointer.

I noticed your two register dumps have different instruction pointer ($rip) values so maybe you are breaking at two different points in the program? Anyway you just need to isolate the instruction that caused the seg fault and see if it is an issue with 16-byte alignment. If it is, use one of the above alignment techniques and see if that resolves it.

Mike