from c++ to SSE-CODES

Smart_Lubobya · ‎02-05-2010

kindly assist on how i can change the codes belowfrom c++ to SSE instruction and obtain same or similar results:

#include
#include
using namespace std;

int main() {
int C[4][4]={1,1,1,1,2,1,-1,-2,1,-1,-1,1,1,-2,2,-1}; // matrix
int X[4][4] = {5,11,8,10,9,8,4,12,1,10,11,4,19,6,15,7};//
double S[4][4] = {0.25,0.1581,0.25,0.1581,0.1581,0.1,0.1581,0.1,0.25,0.1581,0.25,0.1581,0.1581,0.1,0.1581,0.1};
int T[4][4]; // transpose of C
int r[4][4];// resultant of C * X
int d[4][4];// product of r and T
double Y[4][4]; // coordinate-wise product of d and S
int n=4, m=4;

clock_t start,end; //start time and end time

start = clock();// * CLOCKS_PER_SEC;

//compute the transpose of matrix C
for (int i=0;i for (int j=0;jT=C;
for (int i=0;i for (int j=0;j cout << T << " ";
cout << endl;
}

//compute the product of C and X
cout << endl;
for (int i=0;i for (int j=0;j=0;
for (int i=0;i for (int j=0;jfor (int k=0;kr +=C*X;// Product formula
for (int i=0;i for (int j=0;j << " ";
cout << endl;
}
//compute the product of C*X and the tanspose of C
for (int i=0;i for (int j=0;j=0;
for (int i=0;i for (int j=0;jfor (int k=0;kd +=r*T;

for (int j=0; jfor (int i=0; iY = d * S;
}
cout << endl;
for (int i=0;i for (int j=0;j << " ";
cout << endl;
}

cout << endl;
for (int i=0;i for (int j=0;j << " ";
cout << endl;
}

end = clock();

double ticks = (double) (end-start)/CLOCKS_PER_SEC;

cout<< "Elapsed time: "<<<" s"<
system ("PAUSE");
return EXIT_SUCCESS;
}

TimP · ‎02-05-2010

Your question could be taken various ways. There is no contradiction between C++ and SSE; the current Intel compilers default to using SSE with auto-vectorization as much as possible. As your code uses int multiply, you may find there isn't adequate SSE support for those loops. If you set -O3 option for icc, in principle, the compiler has the freedom to swap loops where that may assist with SSE optimization.

With such short loops, the time required may be short compared with the resolution of clock(), so you won't see benefit from further optimization; certainly little pay-off for the effort of using SSE intrinsics where there is no reason to doubt the compiler's ability to optimize.

Smart_Lubobya · ‎02-07-2010

can some demostrate how SSE could be applied in multiplying three (3)2-D matrices as the case of C, X and T . T is transpose of C.

X[4][4]={5,11,8,10,9,8,4,12,1,10,11,4,19,6,15,7}

C[4][4]={1,1,1,1,2,1,-1,-2,1,-1,-1,1,1,-2,2,-1}

AdamB1 · ‎02-07-2010

I think you may be asking two different questions. However, the 2D case is indeed a good test case. Take 2D matricies A and B defined as follows:

a = |a1 a2|  
      |a3 a4|

b = |b1 b2|
      |b3 b4|

Multiply the two out and you get the following:

a*b = | a1*b1+a2*b2    a1*b2+a2*b4 |
         | a3*b1+a4*b3    a3*b2+a4*b4 |

Now, what SSE gives you in this case is the ability to multiply 4 numbers in one cycle. So if you count total multiplications needed to multiply the two 2D matricies, you only need 8. So you should be able to do that in two steps. The steps are to copy the matricies into memory, get the elements in the right order, and do the two multiplications. This code will do that for you:

[bash]struct vect4{
	float a1, a2, a3, a4;
};

vect4 trySSE2D(vect4 &a, vect4 &b)
{
	// for matrix multiplication - first, move vector A 
	// and vector B into the register

	vect4 returnMe;

	__asm{
		MOV EAX, a
		MOV EBX, b
		
		MOVUPS XMM0, [EAX]
		MOVUPS XMM1, [EBX]
		// create copies for second operation
		MOVAPS XMM2, XMM0
		MOVAPS XMM3, XMM1

	    // We are going to use the shuffle to get
		// some useful vectors.

		SHUFPS XMM0, XMM0, 0xA0 // 10 10 00 00
		SHUFPS XMM1, XMM1, 0x44 // 01 00 01 00
		SHUFPS XMM2, XMM2, 0xF5 // 11 11 01 01
		SHUFPS XMM3, XMM3, 0xEE // 11 10 11 10

		// Perform the Multiplications
		MULPS XMM0, XMM1
		MULPS XMM2, XMM3

		// Perform the Addition
		ADDPS XMM0, XMM2

		// copy return from register to memory
		MOVUPS[returnMe], XMM0

	}
	return returnMe;

}[/bash]

Note that for the Intel compiler at least, there are intrinsics you can use to do the same thing (using this code):

[bash]void trySSE2DIntrin(float a[4], float b[4], float returnME[4])
{


	__m128 areg, breg, acpy, bcpy;
	areg = _mm_loadu_ps(a);
	breg = _mm_loadu_ps(b);
	acpy = _mm_loadu_ps(a);
	bcpy = _mm_loadu_ps(b);


	areg = _mm_shuffle_ps(areg, areg, 0xA0); // 10 10 00 00
	breg = _mm_shuffle_ps(breg, breg, 0x44); // 01 00 01 00
	acpy = _mm_shuffle_ps(acpy, acpy, 0xF5); // 10 10 00 00
	bcpy = _mm_shuffle_ps(bcpy, bcpy, 0xEE); // 01 00 01 00


	areg = _mm_mul_ps(areg, breg);
	acpy = _mm_mul_ps(acpy, bcpy);

	areg = _mm_add_ps(areg, acpy);

	_mm_storeu_ps(returnME, areg);

}[/bash]

That is 2D rotation. If you want the transpose of a matrix, it is a simple matter of calling SHUFPS with the correct arguments. I have included some code that gives you both the 2D case for multiplying in a normal very inefficient fashion, the direct assembly code, and the intrinsics, as well as the transpose in direct assembly.

That brings me back to what I think you are really looking for. You want to do a 4x4 matrix (or 4D). That type of operation is far more common in physics (3 spacial and 1 time dimension) and computer graphics (homogeneous coordinates). It is also the size of the arrays you keep linking (although I would do them as 4x4 floats because those will fit in the registers more naturally).

If you take a minute to think about it, you get each element by multiplying a row of 4 numbers times a column of 4 numbers. So the general procedure should be fairly clear. Copy the row of numbers into a 128 bit address (I've actually shown you one way to do that above if you look at the intrinsic function method for 2D). Copy the column into another. Put them both in memory. Perform the multiplication and addition. Copy the result out.

The process is fairly straight forward. I apologize in advance for not writing a full 4D version, but it is rather time consuming. The code I provided for 2Ddoes enough that you should be able to generalize it to 4D. I just want to come back and echo something that tim18 said though. This type of problem is one that compilers have a very easy time vectorizing.I had to search for a way to write it that the compiler didn't recognize and vectorizeautomatically to get my inefficient version. Even then, it only calls a few more instructions per cycle than the coded version. I am not sure you need to use intrinsics or straight assembly code to get the result you desire.