Community
cancel
Showing results for
Did you mean:
Beginner
113 Views

## from c++ to SSE-CODES

kindly assist on how i can change the codes belowfrom c++ to SSE instruction and obtain same or similar results:

#include
#include
using namespace std;

int main() {
int C[4][4]={1,1,1,1,2,1,-1,-2,1,-1,-1,1,1,-2,2,-1}; // matrix
int X[4][4] = {5,11,8,10,9,8,4,12,1,10,11,4,19,6,15,7};//
double S[4][4] = {0.25,0.1581,0.25,0.1581,0.1581,0.1,0.1581,0.1,0.25,0.1581,0.25,0.1581,0.1581,0.1,0.1581,0.1};
int T[4][4]; // transpose of C
int r[4][4];// resultant of C * X
int d[4][4];// product of r and T
double Y[4][4]; // coordinate-wise product of d and S
int n=4, m=4;

clock_t start,end; //start time and end time

start = clock();// * CLOCKS_PER_SEC;

//compute the transpose of matrix C
for (int i=0;i for (int j=0;jT=C;
for (int i=0;i for (int j=0;j cout << T << " ";
cout << endl;
}

//compute the product of C and X
cout << endl;
for (int i=0;i for (int j=0;j=0;
for (int i=0;i for (int j=0;jfor (int k=0;kr +=C*X;// Product formula
for (int i=0;i for (int j=0;j << " ";
cout << endl;
}
//compute the product of C*X and the tanspose of C
for (int i=0;i for (int j=0;j=0;
for (int i=0;i for (int j=0;jfor (int k=0;kd +=r*T;

for (int j=0; jfor (int i=0; iY = d * S;
}
cout << endl;
for (int i=0;i for (int j=0;j << " ";
cout << endl;
}

cout << endl;
for (int i=0;i for (int j=0;j << " ";
cout << endl;
}

end = clock();

double ticks = (double) (end-start)/CLOCKS_PER_SEC;

cout<< "Elapsed time: "<<<" s"<
system ("PAUSE");
return EXIT_SUCCESS;
}

3 Replies
Black Belt
113 Views

Your question could be taken various ways. There is no contradiction between C++ and SSE; the current Intel compilers default to using SSE with auto-vectorization as much as possible. As your code uses int multiply, you may find there isn't adequate SSE support for those loops. If you set -O3 option for icc, in principle, the compiler has the freedom to swap loops where that may assist with SSE optimization.

With such short loops, the time required may be short compared with the resolution of clock(), so you won't see benefit from further optimization; certainly little pay-off for the effort of using SSE intrinsics where there is no reason to doubt the compiler's ability to optimize.

Beginner
113 Views

can some demostrate how SSE could be applied in multiplying three (3)2-D matrices as the case of C, X and T . T is transpose of C.

X[4][4]={5,11,8,10,9,8,4,12,1,10,11,4,19,6,15,7}

C[4][4]={1,1,1,1,2,1,-1,-2,1,-1,-1,1,1,-2,2,-1}

Beginner
113 Views

I think you may be asking two different questions. However, the 2D case is indeed a good test case. Take 2D matricies A and B defined as follows:

```a = |a1 a2|
|a3 a4|```
```b = |b1 b2|
|b3 b4|```

Multiply the two out and you get the following:

```a*b = | a1*b1+a2*b2    a1*b2+a2*b4 |
| a3*b1+a4*b3    a3*b2+a4*b4 |
```
Now, what SSE gives you in this case is the ability to multiply 4 numbers in one cycle. So if you count total multiplications needed to multiply the two 2D matricies, you only need 8. So you should be able to do that in two steps. The steps are to copy the matricies into memory, get the elements in the right order, and do the two multiplications. This code will do that for you:
```[bash]struct vect4{
float a1, a2, a3, a4;
};

vect4 trySSE2D(vect4 &a, vect4 &b)
{
// for matrix multiplication - first, move vector A
// and vector B into the register

vect4 returnMe;

__asm{
MOV EAX, a
MOV EBX, b

MOVUPS XMM0, [EAX]
MOVUPS XMM1, [EBX]
// create copies for second operation
MOVAPS XMM2, XMM0
MOVAPS XMM3, XMM1

// We are going to use the shuffle to get
// some useful vectors.

SHUFPS XMM0, XMM0, 0xA0 // 10 10 00 00
SHUFPS XMM1, XMM1, 0x44 // 01 00 01 00
SHUFPS XMM2, XMM2, 0xF5 // 11 11 01 01
SHUFPS XMM3, XMM3, 0xEE // 11 10 11 10

// Perform the Multiplications
MULPS XMM0, XMM1
MULPS XMM2, XMM3

// copy return from register to memory
MOVUPS[returnMe], XMM0

}
return returnMe;

}[/bash]```

Note that for the Intel compiler at least, there are intrinsics you can use to do the same thing (using this code):

```[bash]void trySSE2DIntrin(float a[4], float b[4], float returnME[4])
{

__m128 areg, breg, acpy, bcpy;

areg = _mm_shuffle_ps(areg, areg, 0xA0); // 10 10 00 00
breg = _mm_shuffle_ps(breg, breg, 0x44); // 01 00 01 00
acpy = _mm_shuffle_ps(acpy, acpy, 0xF5); // 10 10 00 00
bcpy = _mm_shuffle_ps(bcpy, bcpy, 0xEE); // 01 00 01 00

areg = _mm_mul_ps(areg, breg);
acpy = _mm_mul_ps(acpy, bcpy);