Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

improving the for loop performance

Smart_Lubobya
Beginner
526 Views
how can i improve the performance of this loop? any better C++ code equivalent?

.cpp file

void tom::add(void* ptr)

{

short* b =(short*)ptr;

int j;

for(j = 0; j < 16; j += 4)

{

/// 1st stage add

int c0 = (int)(b+ b[j+3]);

int c3 = (int)(b- b[j+3]);

int c1 = (int)(b[j+1] + b[j+2]);

int c2 = (int)(b[j+1] - b[j+2]);

/// 2nd stage add.

b = (short)(c0 + c1);

b[j+2] = (short)(c0 - c1);

b[j+1] = (short)(c2 + (c3 << 1));

b[j+3] = (short)(c3 - (c2 << 1));

}

0 Kudos
9 Replies
jimdempseyatthecove
Honored Contributor III
526 Views
Is your code correct?

in the

int c? = (int)(b[?]op b[?]);

the b[?] op b[?] may exceed the limitations of short

Does the code require

int c? = (int)b[?]op (int)b[?];

Jim Dempsey
0 Kudos
Smart_Lubobya
Beginner
526 Views
yes, the first stage consists of integer declared values. the codes works except they are slow. i quess the bottneck is on the loop, hence my question ; is there a way of optimising this loop to make it fast than the way it is now?
0 Kudos
Milind_Kulkarni__Int
New Contributor II
526 Views
Are you using Auto-vectorization feature of Intel compilers. The default is -O2 , so I think youcould get some vectorization messages..

Also, you can look in User guide of Compiler, the section on , Using Parallelism: Auto-Vectorization> to map option to processor type, there are also some guidelines to take care of, which prevent Vectorization, for eg. mixed data-types, which your program is using in the loop (which is unsupported) , which I am not sure how changing that will affect your application needs..

To enable vectorization, you could also separate the loops like:--

1st stage & 2nd stage to different loop bodies , just to avoid dependencies, and large loop body, i.e, by fissioning the loop. But since it alters the semantics & the results, it could be achieved by using separate array c[16] instead of cn (c0 , ....etc)

Not sure, about any algorithmic change that can be done to increase performance.

Also, since loop body is short (low trip count) , it may be efficient, so you would really have to test the numbers , and experiment in runtime with test-data.

For vectorization to occur, you have to do away with the mixed data types.

Lastly, I am not an expert on vectorization, and thought if I could minimally help you to see past the program algorithm, into the compiler features.

Your vectorizable program looks like below ( the mixture of short's and int's done away)..

[bash]void add(void* ptr)
{ 
int* b =(int*)ptr;
int i, j, c[16];
#pragma vector always
for(i = 0; i < 4; i++ )
{
j=4*i;
/// 1st stage add
c = (b+ b[j+3]);
c[j+3] = (b- b[j+3]);
c[j+1] = (b[j+1] + b[j+2]);
c[j+2] = (b[j+1] - b[j+2]);

}
#pragma vector always
for(i = 0; i < 4;i++ )
{
j=4*i;
/// 2nd stage add.
b = (c + c[j+1]);
b[j+2] = (c - c[j+1]);
b[j+1] = (c[j+2] + (c[j+3] << 1));
b[j+3] = (c[j+3] - (c[j+2] << 1)); 
}
}
[/bash]

Please let know whether that does any justice to your program.

0 Kudos
jimdempseyatthecove
Honored Contributor III
526 Views
Try something like the following untested code

[bash]__int64*	b64 = (__int64*)ptr;
for(int i=0; i < 4; ++i)
{
	__int64	t0 = b64;
	// t0 = {b[i*4+0],b[i*4+1],b[i*4+2],b[i*4+3]}
	__int64 t1 = _lrotl(t0, 16) &0x0000FFFF0000FFFFL;
	// t1 = {0x0000,b[i*4+2],0x0000,b[i*4+0]
	t0 &= 0x0000FFFF0000FFFFL;
	// t0 = {0x0000,b[i*4+1],0x0000,b[i*4+3]}
	__int64 c10 = t0 + t1;
	// c10 = { b[i*4+1]+b[i*4+2], b[i*4+3]+b[i*4+0]}
	// c10 = {         c1,                c0 }
	__int64 c23 = t0 - t1;
	// c23 = { b[i*4+1]-b[i*4+2], b[i*4+3]-b[i*4+0]}
	// c23 = {       c2,                 -c3 }
	((short*)&b64)[0] = (short)(_lrotl(c10,32) + c10);
	//                                c0 + c1
	((short*)&b64)[1] = (short)(_lrotl(c10,32) - c10);
	//                                c0 - c1
	((short*)&b64)[2] = (short)(c23 - _lrotl(c23, 33));
	//                              c2 + (c3<<1)
	((short*)&b64)[3] = (short)(-_lrotl(c23,32) - (c23<<1));
}
[/bash]

Let us know the results.

Jim Dempsey
0 Kudos
Smart_Lubobya
Beginner
526 Views
i tried the code:
1). i got the warning message: 'warning C4068: unknown pragma' on #pragmavectoralways
2).when i commented out the #pragmavectoralways,ptr could not be converted into int
ieint*b=(int*)ptr;could not show any array values during debugging. how do i proceed?

note: i am using MS visio 2008
0 Kudos
Smart_Lubobya
Beginner
526 Views
after trying these codes, i got5 errors: identifier _lrotl, could not be found.
on all the five lines where_lrotl, is used.

i am using MSV 2008 compiler
0 Kudos
Milind_Kulkarni__Int
New Contributor II
526 Views
I thought youwere using Intel compiler, which recognized these pragma. you seem to be using VS2008 compiler. So not to complicate things further, you can use the example provided by Jim..
0 Kudos
jimdempseyatthecove
Honored Contributor III
526 Views
Look in your MSVS 2008 documentation for requried headers
In MSVC 2005,_lrotl requires stdlib.h

Also see if MS changed to double underbar __lrotl

Jim
0 Kudos
Om_S_Intel
Employee
526 Views
Loop count is small to do the vectorization and get optimizationbenefit.
0 Kudos
Reply