Community
cancel
Showing results for
Did you mean:
Beginner
55 Views

## code vectorisation

how do i vectorise the first loop using sse2 intrinsics and achieve faster codes?
#include "stdafx.h"
#include
using namespace std;

int _tmain(int argc, _TCHAR* argv[])
{
short b[4][4]={1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4};
short a[4][4];
short c[4][4];
for (int j =0;j<4;j++)
{
a[0]=b[0]+b[3];
a[1]=b[1]+b[2];
a[2]=b[1]-b[2];
a[3]=b[0]-b[3];
// step 2
c[0]=a[0]+a[1];
c[1]=a[2]+(a[3]<<1);
c[2]=a[0]-a[1];
c[3]=a[3]-(a[2]<<1);
}
for (int i = 0; i < 4; i++)
{
for (int j = 0; j < 4; j++)
cout << a << " ";
cout << endl;
}
cout< for (int i = 0; i < 4; i++)
{
for (int j = 0; j < 4; j++)
cout << c << " ";
cout << endl;
}
return 0;
}

Employee
55 Views
it looks like you have asked same question twice. i will answer here.
1. Is loop count 4 (j<4) ? if so then there is no use of vectorizing this. To make it faster just unroll it w/o any loop. You may get little benefit. To unroll, You can write a simple "C" macro call it 4 times.

2. Assuming that you are having a big loop, then you may able to vectorize it. (Secondly, i dont know whether you need a needs any more or not, or they are just temp variable. assuming that you need them, your SSE code will look like this (you may need to fix little bit here and there):

__m128i temp0 = xmm0;

__m128i temp1 = xmm1;

__m128i temp2 = xmm2;

__m128i temp3 = xmm3;

xmm0= _mm_sub_epi16(xmm0, xmm3);
xmm1= _mm_sub_epi16(xmm1, xmm2);

__m128i temp4 = temp0;

_mm_store_si128(c[0], temp4);
temp0 = _mm_sub_epi16(temp0, temp1);
_mm_store_si128(c[2], temp0);

temp1 = xmm0;
temp4 = xmm1;

temp1 = _mm_slli_epi16(temp1, 1);
temp4 = _mm_slli_epi16(temp4, 1);