aligniing 'btr' or its address using declspec(align(16))

Smart_Lubobya · ‎08-06-2010

how can i align 'btr' or its adress before doing load and add? 'btr' can be represented as an array of integers, floats or short. see code snippet below:
//.h file

Class tom

{

Public:

virtual void add(void* btr);

.

}

// .cpp file

void tom::add(void* btr)

{

int* b = (int *)btr;

__m128i f0,f1,f2,f3,s0,s1;

f0 = _mm_load_si128(b);

f1 = _mm_load_si128(b+1);

f2 = _mm_load_si128(b+2);

f3 = _mm_load_si128(b+3);

s0 =_mm_add_epi16(f0,f3);

s1 =_mm_add_epi16(f1,f2);

}

Brijender_B_Intel · ‎08-06-2010

if you are not sure btr is not aligned, you can put a "C" loop or scalar SSE loopbeforevectorized loop to process initial non-aligned loop. At maxmimum

Something like this:

while( (long long)(&btr) & 15){

scalar code or "C" code

}

or you can declare as aligned pointer as below:
int *btrA = malloc(size+16);

btr = (int*)(((long long)btrA + 15) & ~15);

or
declspec(align) directive..

jeyonpolardhotmail_c · ‎08-07-2010

i thnink it' s scirping language. i'm wright ..... tell me

Online Fax Services

Smart_Lubobya · ‎08-09-2010

tried aligninh like this __declspec(align(16))int* b = (int *)btr; but when debuging load failed?ismy b, b+1,b+2,b+3 arrays well aligned?just where is the bug? the codes compiles well but fails to read values f0 to f3. note 'btr' can be short or int

Class tom

{

Public:

virtual void add(void* btr);

.

}

// .cpp file

void tom::add(void* btr)

{

__declspec(align(16))int* b = (int *)btr;

__m128i f0,f1,f2,f3,s0,s1;

f0 = _mm_load_si128(b);

f1 = _mm_load_si128(b+1);

f2 = _mm_load_si128(b+2);

f3 = _mm_load_si128(b+3);

s0 =_mm_add_epi16(f0,f3);

s1 =_mm_add_epi16(f1,f2);

}

Brijender_B_Intel · ‎08-09-2010

Can you please tell me how "btr" is defined?
Secondly, are you trying to load like this (assume elements are x0, x1, x2, x3, ............)

f0 = x0 x1 x2 x3
f1 = x1 x2 x3 x4
f2 = x2 x3 x4 x5

This is what current code does. If that is the case, you dont need to load all 4 elements (f0, f1, f2, f3). You can load f0 and f3. Then you can shuffle these elements to get f1 and f2. it will save you two unaligned loads.

or do you want to this

f0 = x0 x1 x2 x3
f1 = x4 x5 x6 x7
. .. . . .. .
if this is what you want then you need to do (b+4) for second load and b+8 for third load etc.........

Smart_Lubobya · ‎08-10-2010

thanks for the reply, btr isa 4x4 2-D arrayof short. i want to load the arrays as int of 4 arrays like

f0 = x0 x1 x2 x3
f1 = x4 x5 x6 x7, i have tried your earier advice but only f0 and f1 are loaded. further advice is 100% needed.

Nasile.h

class Nasile

{

public:

virtual void add(void* ptr) = 0;

};

tom.h

#include " Nasile.h"

{

class tom : public Nasile

public:

virtual void add(void* ptr);

};

Tom.cpp

void tom::add(void* ptr)

{

__declspec(align(16))short* b =(short*)ptr;

__m128i s0,s1,f0,f1,f2,f3;

f0 = _mm_load_si128((__m128i*)block);

f1 = _mm_load_si128((__m128i*)block+4);

f2 = _mm_load_si128((__m128i*)block+8);

f3 = _mm_load_si128((__m128i*)block+12);

s0 =_mm_add_epi32(f0,f3);

s1 =_mm_add_epi32(f1,f2);

}

Charles.h

#include " Nasile.h"

{

public:

virtual void Quant (Nasile * pQ) { pQ->add(_pBlk); }

};

Charles.cpp

#include " Charles.h"

_pBlk = new short[16];

Brijender_B_Intel · ‎08-10-2010

So you have 16 elements which are of type short. Short is 2 bytes. So in total you have 32bytes to load (16x2). One XMM regiser can hold 16bytes. So you got those loaded in two load instructions. Your third load will fault as you are going out of range.

so your first regsiter is f0 = x0 x1 x2 x3 x4 x5 x6 x7 (it is packed 8 elements)
similarly second one.

If you want to convert short (2 bytes)to Int (4 bytes) before processing. You need to use unpack instructions. You need to use punpcklwd and punpckhwd (please check ia32/64 programming reference manual volume 2b.
it will be something like this

_m128i temp = _mm_set_epi32(0, 0, 0, 0);
load f0 and f1.

f2 = f0;
f0 = _mm_unpacklo_epi16(f0, temp); // it will put x0 x1 x2 x3 in f0 with 16bits padded with zero for each element.
f2 = _mm_unpackhi_epi16(f2, temp); // it will put hi elements x4 x5 x6 x7 in it.
Do same thing with f1 and f2.

Smart_Lubobya · ‎08-12-2010

thanks for the reply.i have done this:

__m128i f0,f1,f2,f3,temp;

__declspec(align(16))__m128i*b1 =(__m128i*)b;

temp = _mm_set_epi32(0,0,0,0);

f0 = _mm_load_si128(b1);//-114,-93,-25,36,-113,-95,-26,35

f1 = _mm_load_si128(b1+1);//-113,-90,-23,35,-112,-92,-25,38

f2 = f0;

f0 = _mm_unpacklo_epi16(f0,temp);//-114,0,-93,0,-25,0,36,0

f2 = _mm_unpackhi_epi16(f2,temp);//-113,0,-95,0,-26,0,35,0

f3=f1;

f1 = _mm_unpacklo_epi16(f1,temp);//-113,0,-90,0,-23,0,35,0

f3 = _mm_unpackhi_epi16(f3,temp);//-112,0,-92,0,-25,0,38,0

notice: f1 is loaded as f1 = _mm_load_si128(b1+1)
1. the unpacklo and unpackhi is still maintaining 8 elements(short). is there a way of them being 4?
2. is there a way of puting or loading or seting the first elements of f0, f1,f2,f3 in one row ie -114, -113,-112,-114 and say -93,-95,-90,-92 also in another row? i want to add the two rows with _mm_add_epi16().

Brijender_B_Intel · ‎08-12-2010

1. You still have 8 shorts but you got 4 integers over there with higher bits extended with zeros. So these instructions picked the short values and extended with zeroes to make them 32bit. Earlier you had no zero values on those places, and if you read them as integer (32bit) your results will be wrong.
2. You need to shuffle or blends here. I think you can use floating point shuffle (shufps) here which is more flexible then integer one. There may be many clever ways to do it.

assuming f0= x0 x1 x2 x3, f1 = x4 x5 x6 x7, f2= x8 x9 x10 x11, f3 = x12x13 x14 x15
temp0 = f0; //in case you need f0 and f1 later
temp1= f1;
mm_shuffle_ps (f0, f2, 0x44) to make x0 x1 x4 x5 (Please check control byte in manual)
mm_shuffle_ps(f2, f3, 0x44) to make x8 x9 x12 x13
temp2 = f2
mm_shuffle_ps(f2, f0, 0x88) -> x0 x4 x8 x12
mm_shuffle_ps(f0, temp2, 0x88) -> x1 x5 x9 x13

You may have to use come _mm_castsi128_ps() or similar intrinisic to tell compiler to treat f0, f1 f2 f3 etc as floats in each of these intructions.