improving the for loop performance

Smart_Lubobya · ‎08-25-2010

how can i improve the performance of this loop? any better C++ code equivalent?

.cpp file

void tom::add(void* ptr)

{

short* b =(short*)ptr;

int j;

for(j = 0; j < 16; j += 4)

{

/// 1st stage add

int c0 = (int)(b+ b[j+3]);

int c3 = (int)(b- b[j+3]);

int c1 = (int)(b[j+1] + b[j+2]);

int c2 = (int)(b[j+1] - b[j+2]);

/// 2nd stage add.

b = (short)(c0 + c1);

b[j+2] = (short)(c0 - c1);

b[j+1] = (short)(c2 + (c3 << 1));

b[j+3] = (short)(c3 - (c2 << 1));

}

jimdempseyatthecove · ‎08-25-2010

Is your code correct?

in the

int c? = (int)(b[?]op b[?]);

the b[?] op b[?] may exceed the limitations of short

Does the code require

int c? = (int)b[?]op (int)b[?];

Jim Dempsey

Smart_Lubobya · ‎08-26-2010

yes, the first stage consists of integer declared values. the codes works except they are slow. i quess the bottneck is on the loop, hence my question ; is there a way of optimising this loop to make it fast than the way it is now?

Milind_Kulkarni__Int · ‎08-27-2010

Are you using Auto-vectorization feature of Intel compilers. The default is -O2 , so I think youcould get some vectorization messages..

Also, you can look in User guide of Compiler, the section on , Using Parallelism: Auto-Vectorization> to map option to processor type, there are also some guidelines to take care of, which prevent Vectorization, for eg. mixed data-types, which your program is using in the loop (which is unsupported) , which I am not sure how changing that will affect your application needs..

To enable vectorization, you could also separate the loops like:--

1st stage & 2nd stage to different loop bodies , just to avoid dependencies, and large loop body, i.e, by fissioning the loop. But since it alters the semantics & the results, it could be achieved by using separate array c[16] instead of cn (c0 , ....etc)

Not sure, about any algorithmic change that can be done to increase performance.

Also, since loop body is short (low trip count) , it may be efficient, so you would really have to test the numbers , and experiment in runtime with test-data.

For vectorization to occur, you have to do away with the mixed data types.

Lastly, I am not an expert on vectorization, and thought if I could minimally help you to see past the program algorithm, into the compiler features.

Your vectorizable program looks like below ( the mixture of short's and int's done away)..

[bash]void add(void* ptr)
{ 
int* b =(int*)ptr;
int i, j, c[16];
#pragma vector always
for(i = 0; i < 4; i++ )
{
j=4*i;
/// 1st stage add
c = (b+ b[j+3]);
c[j+3] = (b- b[j+3]);
c[j+1] = (b[j+1] + b[j+2]);
c[j+2] = (b[j+1] - b[j+2]);

}
#pragma vector always
for(i = 0; i < 4;i++ )
{
j=4*i;
/// 2nd stage add.
b = (c + c[j+1]);
b[j+2] = (c - c[j+1]);
b[j+1] = (c[j+2] + (c[j+3] << 1));
b[j+3] = (c[j+3] - (c[j+2] << 1)); 
}
}
[/bash]

Please let know whether that does any justice to your program.

jimdempseyatthecove · ‎08-27-2010

Try something like the following untested code

[bash]__int64*	b64 = (__int64*)ptr;
for(int i=0; i < 4; ++i)
{
	__int64	t0 = b64;
	// t0 = {b[i*4+0],b[i*4+1],b[i*4+2],b[i*4+3]}
	__int64 t1 = _lrotl(t0, 16) &0x0000FFFF0000FFFFL;
	// t1 = {0x0000,b[i*4+2],0x0000,b[i*4+0]
	t0 &= 0x0000FFFF0000FFFFL;
	// t0 = {0x0000,b[i*4+1],0x0000,b[i*4+3]}
	__int64 c10 = t0 + t1;
	// c10 = { b[i*4+1]+b[i*4+2], b[i*4+3]+b[i*4+0]}
	// c10 = {         c1,                c0 }
	__int64 c23 = t0 - t1;
	// c23 = { b[i*4+1]-b[i*4+2], b[i*4+3]-b[i*4+0]}
	// c23 = {       c2,                 -c3 }
	((short*)&b64)[0] = (short)(_lrotl(c10,32) + c10);
	//                                c0 + c1
	((short*)&b64)[1] = (short)(_lrotl(c10,32) - c10);
	//                                c0 - c1
	((short*)&b64)[2] = (short)(c23 - _lrotl(c23, 33));
	//                              c2 + (c3<<1)
	((short*)&b64)[3] = (short)(-_lrotl(c23,32) - (c23<<1));
}
[/bash]

Let us know the results.

Jim Dempsey

Smart_Lubobya · ‎08-28-2010

i tried the code:
1). i got the warning message: 'warning C4068: unknown pragma' on #pragmavectoralways
2).when i commented out the #pragmavectoralways,ptr could not be converted into int
ieint*b=(int*)ptr;could not show any array values during debugging. how do i proceed?

note: i am using MS visio 2008

Smart_Lubobya · ‎08-28-2010

after trying these codes, i got5 errors: identifier _lrotl, could not be found.
on all the five lines where_lrotl, is used.

i am using MSV 2008 compiler

Milind_Kulkarni__Int · ‎08-28-2010

I thought youwere using Intel compiler, which recognized these pragma. you seem to be using VS2008 compiler. So not to complicate things further, you can use the example provided by Jim..

jimdempseyatthecove · ‎08-29-2010

Look in your MSVS 2008 documentation for requried headers
In MSVC 2005,_lrotl requires stdlib.h

Also see if MS changed to double underbar __lrotl

Jim

Om_S_Intel · ‎08-30-2010

Loop count is small to do the vectorization and get optimizationbenefit.