Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
1094 Discussions

aligniing 'btr' or its address using declspec(align(16))

Smart_Lubobya
Beginner
573 Views

how can i align 'btr' or its adress before doing load and add? 'btr' can be represented as an array of integers, floats or short. see code snippet below:
//.h file

Class tom

{

Public:

virtual void add(void* btr);

.

.

.

}

// .cpp file

void tom::add(void* btr)

{

int* b = (int *)btr;

__m128i f0,f1,f2,f3,s0,s1;

f0 = _mm_load_si128(b);

f1 = _mm_load_si128(b+1);

f2 = _mm_load_si128(b+2);

f3 = _mm_load_si128(b+3);

s0 =_mm_add_epi16(f0,f3);

s1 =_mm_add_epi16(f1,f2);

}

0 Kudos
8 Replies
Brijender_B_Intel
573 Views
if you are not sure btr is not aligned, you can put a "C" loop or scalar SSE loopbeforevectorized loop to process initial non-aligned loop. At maxmimum

Something like this:

while( (long long)(&btr) & 15){

scalar code or "C" code

}


or you can declare as aligned pointer as below:
int *btrA = malloc(size+16);

btr = (int*)(((long long)btrA + 15) & ~15);

or
declspec(align) directive..

0 Kudos
jeyonpolardhotmail_c
573 Views
i thnink it' s scirping language. i'm wright ..... tell me


Online Fax Services
0 Kudos
Smart_Lubobya
Beginner
573 Views
tried aligninh like this __declspec(align(16))int* b = (int *)btr; but when debuging load failed?ismy b, b+1,b+2,b+3 arrays well aligned?just where is the bug? the codes compiles well but fails to read values f0 to f3. note 'btr' can be short or int

Class tom

{

Public:

virtual void add(void* btr);

.

.

.

}

// .cpp file

void tom::add(void* btr)

{

__declspec(align(16))int* b = (int *)btr;

__m128i f0,f1,f2,f3,s0,s1;

f0 = _mm_load_si128(b);

f1 = _mm_load_si128(b+1);

f2 = _mm_load_si128(b+2);

f3 = _mm_load_si128(b+3);

s0 =_mm_add_epi16(f0,f3);

s1 =_mm_add_epi16(f1,f2);

}

0 Kudos
Brijender_B_Intel
573 Views
Can you please tell me how "btr" is defined?
Secondly, are you trying to load like this (assume elements are x0, x1, x2, x3, ............)

f0 = x0 x1 x2 x3
f1 = x1 x2 x3 x4
f2 = x2 x3 x4 x5

This is what current code does. If that is the case, you dont need to load all 4 elements (f0, f1, f2, f3). You can load f0 and f3. Then you can shuffle these elements to get f1 and f2. it will save you two unaligned loads.

or do you want to this

f0 = x0 x1 x2 x3
f1 = x4 x5 x6 x7
. .. . . .. .
if this is what you want then you need to do (b+4) for second load and b+8 for third load etc.........
0 Kudos
Smart_Lubobya
Beginner
573 Views
thanks for the reply, btr isa 4x4 2-D arrayof short. i want to load the arrays as int of 4 arrays like

f0 = x0 x1 x2 x3
f1 = x4 x5 x6 x7, i have tried your earier advice but only f0 and f1 are loaded. further advice is 100% needed.

Nasile.h

class Nasile

{

public:

virtual void add(void* ptr) = 0;

};

tom.h

#include " Nasile.h"

{

class tom : public Nasile

public:

virtual void add(void* ptr);

};

Tom.cpp

void tom::add(void* ptr)

{

__declspec(align(16))short* b =(short*)ptr;

__m128i s0,s1,f0,f1,f2,f3;

f0 = _mm_load_si128((__m128i*)block);

f1 = _mm_load_si128((__m128i*)block+4);

f2 = _mm_load_si128((__m128i*)block+8);

f3 = _mm_load_si128((__m128i*)block+12);

s0 =_mm_add_epi32(f0,f3);

s1 =_mm_add_epi32(f1,f2);

}

Charles.h

#include " Nasile.h"

{

public:

virtual void Quant (Nasile * pQ) { pQ->add(_pBlk); }

};

Charles.cpp

#include " Charles.h"

_pBlk = new short[16];

0 Kudos
Brijender_B_Intel
573 Views
So you have 16 elements which are of type short. Short is 2 bytes. So in total you have 32bytes to load (16x2). One XMM regiser can hold 16bytes. So you got those loaded in two load instructions. Your third load will fault as you are going out of range.

so your first regsiter is f0 = x0 x1 x2 x3 x4 x5 x6 x7 (it is packed 8 elements)
similarly second one.

If you want to convert short (2 bytes)to Int (4 bytes) before processing. You need to use unpack instructions. You need to use punpcklwd and punpckhwd (please check ia32/64 programming reference manual volume 2b.
it will be something like this

_m128i temp = _mm_set_epi32(0, 0, 0, 0);
load f0 and f1.

f2 = f0;
f0 = _mm_unpacklo_epi16(f0, temp); // it will put x0 x1 x2 x3 in f0 with 16bits padded with zero for each element.
f2 = _mm_unpackhi_epi16(f2, temp); // it will put hi elements x4 x5 x6 x7 in it.
Do same thing with f1 and f2.
0 Kudos
Smart_Lubobya
Beginner
573 Views
thanks for the reply.i have done this:


__m128i f0,f1,f2,f3,temp;

__declspec(align(16))__m128i*b1 =(__m128i*)b;

temp = _mm_set_epi32(0,0,0,0);

f0 = _mm_load_si128(b1);//-114,-93,-25,36,-113,-95,-26,35

f1 = _mm_load_si128(b1+1);//-113,-90,-23,35,-112,-92,-25,38

f2 = f0;

f0 = _mm_unpacklo_epi16(f0,temp);//-114,0,-93,0,-25,0,36,0

f2 = _mm_unpackhi_epi16(f2,temp);//-113,0,-95,0,-26,0,35,0

f3=f1;

f1 = _mm_unpacklo_epi16(f1,temp);//-113,0,-90,0,-23,0,35,0

f3 = _mm_unpackhi_epi16(f3,temp);//-112,0,-92,0,-25,0,38,0

notice: f1 is loaded as f1 = _mm_load_si128(b1+1)
1. the unpacklo and unpackhi is still maintaining 8 elements(short). is there a way of them being 4?
2. is there a way of puting or loading or seting the first elements of f0, f1,f2,f3 in one row ie -114, -113,-112,-114 and say -93,-95,-90,-92 also in another row? i want to add the two rows with _mm_add_epi16().

0 Kudos
Brijender_B_Intel
573 Views
1. You still have 8 shorts but you got 4 integers over there with higher bits extended with zeros. So these instructions picked the short values and extended with zeroes to make them 32bit. Earlier you had no zero values on those places, and if you read them as integer (32bit) your results will be wrong.
2. You need to shuffle or blends here. I think you can use floating point shuffle (shufps) here which is more flexible then integer one. There may be many clever ways to do it.

assuming f0= x0 x1 x2 x3, f1 = x4 x5 x6 x7, f2= x8 x9 x10 x11, f3 = x12x13 x14 x15
temp0 = f0; //in case you need f0 and f1 later
temp1= f1;
mm_shuffle_ps (f0, f2, 0x44) to make x0 x1 x4 x5 (Please check control byte in manual)
mm_shuffle_ps(f2, f3, 0x44) to make x8 x9 x12 x13
temp2 = f2
mm_shuffle_ps(f2, f0, 0x88) -> x0 x4 x8 x12
mm_shuffle_ps(f0, temp2, 0x88) -> x1 x5 x9 x13

You may have to use come _mm_castsi128_ps() or similar intrinisic to tell compiler to treat f0, f1 f2 f3 etc as floats in each of these intructions.
0 Kudos
Reply