- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
how can i improve the performance of this loop? any better C++ code equivalent?
.cpp file
void tom::add(void* ptr)
{
short* b =(short*)ptr;
int j;
for(j = 0; j < 16; j += 4)
{
/// 1st stage add
int c0 = (int)(b
int c3 = (int)(b
int c1 = (int)(b[j+1] + b[j+2]);
int c2 = (int)(b[j+1] - b[j+2]);
/// 2nd stage add.
b
b[j+2] = (short)(c0 - c1);
b[j+1] = (short)(c2 + (c3 << 1));
b[j+3] = (short)(c3 - (c2 << 1));
}
Link Copied
9 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is your code correct?
in the
int c? = (int)(b[?]op b[?]);
the b[?] op b[?] may exceed the limitations of short
Does the code require
int c? = (int)b[?]op (int)b[?];
Jim Dempsey
in the
int c? = (int)(b[?]op b[?]);
the b[?] op b[?] may exceed the limitations of short
Does the code require
int c? = (int)b[?]op (int)b[?];
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
yes, the first stage consists of integer declared values. the codes works except they are slow. i quess the bottneck is on the loop, hence my question ; is there a way of optimising this loop to make it fast than the way it is now?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are you using Auto-vectorization feature of Intel compilers. The default is -O2 , so I think youcould get some vectorization messages..
Also, you can look in User guide of Compiler, the section on , Using Parallelism: Auto-Vectorization> to map option to processor type, there are also some guidelines to take care of, which prevent Vectorization, for eg. mixed data-types, which your program is using in the loop (which is unsupported) , which I am not sure how changing that will affect your application needs..
To enable vectorization, you could also separate the loops like:--
1st stage & 2nd stage to different loop bodies , just to avoid dependencies, and large loop body, i.e, by fissioning the loop. But since it alters the semantics & the results, it could be achieved by using separate array c[16] instead of cn (c0 , ....etc)
Not sure, about any algorithmic change that can be done to increase performance.
Also, since loop body is short (low trip count) , it may be efficient, so you would really have to test the numbers , and experiment in runtime with test-data.
For vectorization to occur, you have to do away with the mixed data types.
Lastly, I am not an expert on vectorization, and thought if I could minimally help you to see past the program algorithm, into the compiler features.
Your vectorizable program looks like below ( the mixture of short's and int's done away)..
Please let know whether that does any justice to your program.
Also, you can look in User guide of Compiler, the section on
To enable vectorization, you could also separate the loops like:--
1st stage & 2nd stage to different loop bodies , just to avoid dependencies, and large loop body, i.e, by fissioning the loop. But since it alters the semantics & the results, it could be achieved by using separate array c[16] instead of cn (c0 , ....etc)
Not sure, about any algorithmic change that can be done to increase performance.
Also, since loop body is short (low trip count) , it may be efficient, so you would really have to test the numbers , and experiment in runtime with test-data.
For vectorization to occur, you have to do away with the mixed data types.
Lastly, I am not an expert on vectorization, and thought if I could minimally help you to see past the program algorithm, into the compiler features.
Your vectorizable program looks like below ( the mixture of short's and int's done away)..
[bash]void add(void* ptr) { int* b =(int*)ptr; int i, j, c[16]; #pragma vector always for(i = 0; i < 4; i++ ) { j=4*i; /// 1st stage add c= (b + b[j+3]); c[j+3] = (b - b[j+3]); c[j+1] = (b[j+1] + b[j+2]); c[j+2] = (b[j+1] - b[j+2]); } #pragma vector always for(i = 0; i < 4;i++ ) { j=4*i; /// 2nd stage add. b = (c + c[j+1]); b[j+2] = (c - c[j+1]); b[j+1] = (c[j+2] + (c[j+3] << 1)); b[j+3] = (c[j+3] - (c[j+2] << 1)); } } [/bash]
Please let know whether that does any justice to your program.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try something like the following untested code
Let us know the results.
Jim Dempsey
[bash]__int64* b64 = (__int64*)ptr; for(int i=0; i < 4; ++i) { __int64 t0 = b64; // t0 = {b[i*4+0],b[i*4+1],b[i*4+2],b[i*4+3]} __int64 t1 = _lrotl(t0, 16) &0x0000FFFF0000FFFFL; // t1 = {0x0000,b[i*4+2],0x0000,b[i*4+0] t0 &= 0x0000FFFF0000FFFFL; // t0 = {0x0000,b[i*4+1],0x0000,b[i*4+3]} __int64 c10 = t0 + t1; // c10 = { b[i*4+1]+b[i*4+2], b[i*4+3]+b[i*4+0]} // c10 = { c1, c0 } __int64 c23 = t0 - t1; // c23 = { b[i*4+1]-b[i*4+2], b[i*4+3]-b[i*4+0]} // c23 = { c2, -c3 } ((short*)&b64)[0] = (short)(_lrotl(c10,32) + c10); // c0 + c1 ((short*)&b64)[1] = (short)(_lrotl(c10,32) - c10); // c0 - c1 ((short*)&b64)[2] = (short)(c23 - _lrotl(c23, 33)); // c2 + (c3<<1) ((short*)&b64)[3] = (short)(-_lrotl(c23,32) - (c23<<1)); } [/bash]
Let us know the results.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
i tried the code:
1). i got the warning message: 'warning C4068: unknown pragma' on #pragmavectoralways
2).when i commented out the #pragmavectoralways,ptr could not be converted into int
ieint*b=(int*)ptr;could not show any array values during debugging. how do i proceed?
note: i am using MS visio 2008
1). i got the warning message: 'warning C4068: unknown pragma' on #pragmavectoralways
2).when i commented out the #pragmavectoralways,ptr could not be converted into int
ieint*b=(int*)ptr;could not show any array values during debugging. how do i proceed?
note: i am using MS visio 2008
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
after trying these codes, i got5 errors: identifier _lrotl, could not be found.
on all the five lines where_lrotl, is used.
i am using MSV 2008 compiler
on all the five lines where_lrotl, is used.
i am using MSV 2008 compiler
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I thought youwere using Intel compiler, which recognized these pragma. you seem to be using VS2008 compiler. So not to complicate things further, you can use the example provided by Jim..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Look in your MSVS 2008 documentation for requried headers
In MSVC 2005,_lrotl requires stdlib.h
Also see if MS changed to double underbar __lrotl
Jim
In MSVC 2005,_lrotl requires stdlib.h
Also see if MS changed to double underbar __lrotl
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Loop count is small to do the vectorization and get optimizationbenefit.

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page