- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

how can i align 'btr' or its adress before doing load and add? 'btr' can be represented as an array of integers, floats or short. see code snippet below:

//.h file

Class tom

{

Public:

virtual void add(void* btr);

.

.

.

}

// .cpp file

void tom::add(void* btr)

{

int* b = (int *)btr;

__m128i f0,f1,f2,f3,s0,s1;

f0 = _mm_load_si128(b);

f1 = _mm_load_si128(b+1);

f2 = _mm_load_si128(b+2);

f3 = _mm_load_si128(b+3);

s0 =_mm_add_epi16(f0,f3);

s1 =_mm_add_epi16(f1,f2);

}

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Something like this:

while( (long long)(&btr*) & 15){*

scalar code or "C" code

}

or you can declare as aligned pointer as below:

int *btrA = malloc(size+16);

btr = (int*)(((long long)btrA + 15) & ~15);

or

declspec(align) directive..

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Class tom

{

Public:

virtual void add(void* btr);

.

.

.

}

// .cpp file

void tom::add(void* btr)

{

__declspec(align(16))int* b = (int *)btr;

__m128i f0,f1,f2,f3,s0,s1;

f0 = _mm_load_si128(b);

f1 = _mm_load_si128(b+1);

f2 = _mm_load_si128(b+2);

f3 = _mm_load_si128(b+3);

s0 =_mm_add_epi16(f0,f3);

s1 =_mm_add_epi16(f1,f2);

}

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Secondly, are you trying to load like this (assume elements are x0, x1, x2, x3, ............)

f0 = x0 x1 x2 x3

f1 = x1 x2 x3 x4

f2 = x2 x3 x4 x5

This is what current code does. If that is the case, you dont need to load all 4 elements (f0, f1, f2, f3). You can load f0 and f3. Then you can shuffle these elements to get f1 and f2. it will save you two unaligned loads.

or do you want to this

f0 = x0 x1 x2 x3

f1 = x4 x5 x6 x7

. .. . . .. .

if this is what you want then you need to do (b+4) for second load and b+8 for third load etc.........

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

f0 = x0 x1 x2 x3

f1 = x4 x5 x6 x7, i have tried your earier advice but only f0 and f1 are loaded. further advice is 100% needed.

Nasile.h

class Nasile

{

public:

virtual void add(void* ptr) = 0;

};

tom.h

#include " Nasile.h"

{

class tom : public Nasile

public:

virtual void add(void* ptr);

};

Tom.cpp

void tom::add(void* ptr)

{

__declspec(align(16))short* b =(short*)ptr;

__m128i s0,s1,f0,f1,f2,f3;

f0 = _mm_load_si128((__m128i*)block);

f1 = _mm_load_si128((__m128i*)block+4);

f2 = _mm_load_si128((__m128i*)block+8);

f3 = _mm_load_si128((__m128i*)block+12);

s0 =_mm_add_epi32(f0,f3);

s1 =_mm_add_epi32(f1,f2);

}

Charles.h

#include " Nasile.h"

{

public:

virtual void Quant (Nasile * pQ) { pQ->add(_pBlk); }

};

Charles.cpp

#include " Charles.h"

_pBlk = new short[16];

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

so your first regsiter is f0 = x0 x1 x2 x3 x4 x5 x6 x7 (it is packed 8 elements)

similarly second one.

If you want to convert short (2 bytes)to Int (4 bytes) before processing. You need to use unpack instructions. You need to use punpcklwd and punpckhwd (please check ia32/64 programming reference manual volume 2b.

it will be something like this

_m128i temp = _mm_set_epi32(0, 0, 0, 0);

load f0 and f1.

f2 = f0;

f0 = _mm_unpacklo_epi16(f0, temp); // it will put x0 x1 x2 x3 in f0 with 16bits padded with zero for each element.

f2 = _mm_unpackhi_epi16(f2, temp); // it will put hi elements x4 x5 x6 x7 in it.

Do same thing with f1 and f2.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

__m128i f0,f1,f2,f3,temp;

__declspec(align(16))__m128i*b1 =(__m128i*)b;

temp = _mm_set_epi32(0,0,0,0);

f0 = _mm_load_si128(b1);//-114,-93,-25,36,-113,-95,-26,35

f1 = _mm_load_si128(b1+1);//-113,-90,-23,35,-112,-92,-25,38

f2 = f0;

f0 = _mm_unpacklo_epi16(f0,temp);//-114,0,-93,0,-25,0,36,0

f2 = _mm_unpackhi_epi16(f2,temp);//-113,0,-95,0,-26,0,35,0

f3=f1;

f1 = _mm_unpacklo_epi16(f1,temp);//-113,0,-90,0,-23,0,35,0

f3 = _mm_unpackhi_epi16(f3,temp);//-112,0,-92,0,-25,0,38,0

notice: f1 is loaded as f1 = _mm_load_si128(b1+1)

1. the unpacklo and unpackhi is still maintaining 8 elements(short). is there a way of them being 4?

2. is there a way of puting or loading or seting the first elements of f0, f1,f2,f3 in one row ie -114, -113,-112,-114 and say -93,-95,-90,-92 also in another row? i want to add the two rows with _mm_add_epi16().

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

2. You need to shuffle or blends here. I think you can use floating point shuffle (shufps) here which is more flexible then integer one. There may be many clever ways to do it.

assuming f0= x0 x1 x2 x3, f1 = x4 x5 x6 x7, f2= x8 x9 x10 x11, f3 = x12x13 x14 x15

temp0 = f0; //in case you need f0 and f1 later

temp1= f1;

mm_shuffle_ps (f0, f2, 0x44) to make x0 x1 x4 x5 (Please check control byte in manual)

mm_shuffle_ps(f2, f3, 0x44) to make x8 x9 x12 x13

temp2 = f2

mm_shuffle_ps(f2, f0, 0x88) -> x0 x4 x8 x12

mm_shuffle_ps(f0, temp2, 0x88) -> x1 x5 x9 x13

You may have to use come _mm_castsi128_ps() or similar intrinisic to tell compiler to treat f0, f1 f2 f3 etc as floats in each of these intructions.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page