Community
cancel
Showing results for 
Search instead for 
Did you mean: 
FortCpp
Beginner
164 Views

Improve the performance of a sum

Hello Intel,

I am using a scientific calculation code. And I want to improve it a little bit if possible. I check the code with Amplifier. The most time consuming (heavily used) code is this:

[cpp]

double a = 0.0;
for(j = 0; j < n; j++) a += w*fi[((index + i)<<ldf) + k];

[/cpp]

To me it is just a dot product between w and fi. I am wondering:

1. Does Intel compiler will do it automaticall? (I mean treated the loop as the dot product of two vecterized array.)

2. Is there a way to improve the code? (I mean maybe define another array a1 the same size of w. Then all multiplied number can be stored in a1 (unrolled loop?). Do summation in the end. )

3. Other suggestions?

I am using parallel composer 2013 with visual studio. Any idea will be appreicated!:)

0 Kudos
68 Replies
TimP
Black Belt
121 Views

SSE4.1 or later would support such optimization better than the earlier instruction sets. You would want to assure that the variables which presumably are loop invariant have local definition so are not subject to aliasing. Of course you would assure that your code is standard compliant so you can set -ansi-alias. Did you look up the icc pragmas ?, e.g. #pragma simd reduction(+: a) If the loop is big enough you may wish to add threaded reduction on top of vector reduction, using OpenMP or Cilk+ What do -opt-report and the guide options say?
Royi
Novice
121 Views

Could you elaborate on " You would want to assure that the variables which presumably are loop invariant have local definition so are not subject to aliasing"? Thank You.
TimP
Black Belt
121 Views

For example, for(int j=0.... would give j local scope and avoid the compiler being concerned about side effects on it. Likewise, i, k, ldf should ideally have a local scope, as a does. If you can't use restrict qualifiers on w[] and index[], as you mentioned you would require #pragma ivdep. #pragma simd reduction would ignore all such dependencies, obviating need for #pragma ivdep, but also ignoring proven dependencies, so you should make an effort to get it optimized or obtain compiler diagnostics without simd first.
FortCpp
Beginner
121 Views

Thanks for your comment Timp. SSE 4.1 did improve the performance. But still not very significant. In your post there are a lot of OpenMP related issue. I didn't know OpenMP. But I think I get a general idea now. I am just curious of the serial optimization of the loop. I mean the speeding up solution for the situation that all numbers are not next to each other in the memory. I also asked the same question in Stack Overflow. And it seems that rearranging is the only choice. I know that I can get some performance improvement by using more than one core. But I want to improve it if possible before I move on to OpenMP or MPI. And I do think even I use OpenMP, the low efficient cache usage is still the problem. Does OpenMP provide more than one thread that solve the non alias data set? Could you please tell me some more about your idea?
TimP
Black Belt
121 Views

Indirect access such as you have quoted certainly raises issues of cache locality which can't be discussed without more knowledge of your application. The point of OpenMP is to parallelize your application across multiple threads. Taking several threads together, there is typically more aggregate cache resource than when running a single thread, so it should help in handling larger data sets efficiently. With the indirect access, it's not necessarily a fait accompli. OpenMP is more often suited to nested loop situations where the inner loop is vectorized and the outer one partitioned (automatically) among threads. With MPI, you partition your data set so that each CPU has access only to a part of it. It may work well as an outer level of parallelization on top of OpenMP.
FortCpp
Beginner
121 Views

OK. I'll try OpenMP. But still I don't understand where is the benefit of using more than one threads. The code that I am working with is a huge calculation in general, the data is around some GB in the memory. The loop may apply on big vectors. For example, think about a 3D matrix, and it has index-x, index-y and index-z. And the order that these matrix elements in the memory is x fast, y medium and z slow (x1y1z1 -> x2y1z1 -> x3y1z1 ->...-> x1y2z1 -> x2y2z1 ->.......). But when I want to take a slice (for fixed x, y) of it and multiply it with a given vector, the only thing that a CPU can do is to load a portion of the matrix and select a piece which is needed and apply the multiplication. There are data which is not useful at this time and should be ignored. Then how can the multi-thread on a single processor speed it up? Or there are other optimization ways?
SergeyKostrov
Valued Contributor II
121 Views

>>...SSE 4.1 did improve the performance. But still not very significant... . Did you try to unroll the loop? I have a similar piece of code ( not exactly, of course ) and I remember that unrolling 4-in-1 improved performance of summation. I'll provide some numbers if interested and try to do some testing with your codes. . >>...the data is around some GB in the memory... . How much memory do you have on your system? Is it 32-bit or 64-bit platform? . I wouldn't expect a significant improvement in performance if some parts of your data are saved in a Virtual Memory file during the processing due to lack of available Physical Memory.
SergeyKostrov
Valued Contributor II
121 Views

>>...for(j = 0; j < n; j++) a += w*fi[((index + i) left-shift ldf) + k]; . Could you provide some technical details for i, ldf and k variables? Could I use any values for test purposes? Thanks in advance. . Note: I wonder if the editor problem with a left-arrow character could be fixed?
SergeyKostrov
Valued Contributor II
121 Views

>>...Did you try to unroll the loop? I have a similar piece of code ( not exactly, of course ) and I remember that unrolling 4-in-1 improved performance of >>summation. I'll provide some numbers... . Here are results ( two matrices addition / sizes 2048x2048 / single-precision data type 'float' ): . - 4-in-1 unroll improves performance by ~3% - 8-in-1 unroll improves performance by ~4% . I always use a hard-coded for-loops unrolling because I can't depend on C++ compilers' pragma directives due to portability issues.
SergeyKostrov
Valued Contributor II
121 Views

>>...Or there are other optimization ways? . Here is a list of different techniques I would try: . - if accuracy is not a crucial issue use a single-precision data type 'float' instead of double-precision data type 'double' - a distribution of values in 'index' array needs to be analyzed ( it creates a random acces to array 'fi' ) - lock in memory 'index' array using a 'VirtualLock' Win32 API function - software prefetch ( as inline assembler instruction ) could improve performance by 0.5 - 1.0% - raise a priority of the process / thread to Above Normal or High could improve performance by 0.5 - 1.0% ( do not change priority to Real Time ) - warm up your data by pre-scanning arrays before processing ( it will load as much as possible data into memory ) . Let me know if you have any questions.
jimdempseyatthecove
Black Belt
121 Views

The following may work better on systems supporting gather. On systems without gather it may make better use of compiler temporaries. double a = 0.0; int* fi_index = (int*)_alloca(n*sizeof(int)); // construct indexes for(j = 0; j < n; j++) fi_index [((index + i) << ldf) + k; // for(j = 0; j .lt. n; j++) fi_index [((index + i) .lt..lt. ldf) + k; // use indexes for(j = 0; j < n; j++) a += w*fi[fi_index]; // for(j = 0; j .lt. n; j++) a += w*fi[fi_index]; *** Please fix this forum such that one can copy and paste sample code without having to figure out html *** Or take our time to edit and post and reedit and repost and reedit and post, .... *** Your time taken once will save our time over and over and over again. Jim Dempsey
SergeyKostrov
Valued Contributor II
121 Views

>>*** Please fix this forum such that one can copy and paste sample code without having to figure out html >>*** Or take our time to edit and post and reedit and repost and reedit and post, .... >>*** Your time taken once will save our time over and over and over again. . Agree with Jim. On the new IDZ forum a 'Preview' button for a post was deleted by some unknown reason. Why did you do it? I used the 'Preview' button always before submitting a new post and it allowed me to fix some little issues in the text or source codes.
FortCpp
Beginner
121 Views

Sergey Kostrov wrote:

>>...SSE 4.1 did improve the performance. But still not very significant...
.
Did you try to unroll the loop? I have a similar piece of code ( not exactly, of course ) and I remember that unrolling 4-in-1 improved performance of summation. I'll provide some numbers if interested and try to do some testing with your codes.
.
>>...the data is around some GB in the memory...
.
How much memory do you have on your system? Is it 32-bit or 64-bit platform?
.
I wouldn't expect a significant improvement in performance if some parts of your data are saved in a Virtual Memory file during the processing due to lack of available Physical Memory.

I did. /Qunroll:100000000 But I don't think the unrolling would contribute. You see each time the sum is "a += ...". There is a dependency here. Meaning the n+1 step depends on its previous step. If a was an array, you could unroll the loop but here a is just a double variable. That's my understanding of the unrolling. I added /Qunroll:100000000 there, but maybe it doesn't take any effect. I have 4GB free in my system (windows 7 64bit. 8GB total. And I am using 64bit config).
FortCpp
Beginner
121 Views

Sergey Kostrov wrote:

>>...for(j = 0; j < n; j++) a += w*fi[((index + i) left-shift ldf) + k];
.
Could you provide some technical details for i, ldf and k variables? Could I use any values for test purposes? Thanks in advance.
.
Note: I wonder if the editor problem with a left-arrow character could be fixed?

Sorry, it is a bit complicated. I don't have some simple examples for you to try out. Here is my understanding: Overall this is a piece of calculating the eigenvalue problem using conjugate gradient method. Maybe the code above is a part of preparing the matrix to solve. The matrix, for example in the test calculation is a million by million sparse matrix. And the non zero elements are 3 million I believe (some sort of tri-diagonal). But when come to the actual usage, maybe the size will go to 10 million or even bigger. I start to focus on it is because I checked it with amplifier. By looking at the result, I know the 2 most time consuming functions are the MKL (intel version I used) functions (dot product and the other one I don't know very well). That look is the most significant time consuming thing other than those.
FortCpp
Beginner
121 Views

Sergey Kostrov wrote:

>>...Or there are other optimization ways?
.
Here is a list of different techniques I would try:
.
- if accuracy is not a crucial issue use a single-precision data type 'float' instead of double-precision data type 'double'
- a distribution of values in 'index' array needs to be analyzed ( it creates a random acces to array 'fi' )
- lock in memory 'index' array using a 'VirtualLock' Win32 API function
- software prefetch ( as inline assembler instruction ) could improve performance by 0.5 - 1.0%
- raise a priority of the process / thread to Above Normal or High could improve performance by 0.5 - 1.0% ( do not change priority to Real Time )
- warm up your data by pre-scanning arrays before processing ( it will load as much as possible data into memory )
.
Let me know if you have any questions.

Using float is actually works for time shrinking purpose But for the time being I'd like to keep the precision. All the matrices and vectors are already in the memory so there is no loading issue. I believe now the bottle neck is the cache. I never did raise priority. And I should try it. Could you explain more: 1. "lock in memory 'index' array using a 'VirtualLock' Win32 API function" 2. "software prefetch ( as inline assembler instruction ) could improve performance by 0.5 - 1.0%" Thanks a lot for your nice response!
FortCpp
Beginner
121 Views

jimdempseyatthecove wrote:

The following may work better on systems supporting gather.
On systems without gather it may make better use of compiler temporaries.
double a = 0.0;

int* fi_index = (int*)_alloca(n*sizeof(int));

// construct indexes

for(j = 0; j < n; j++) fi_index [((index + i) << ldf) + k;
// for(j = 0; j .lt. n; j++) fi_index [((index + i) .lt..lt. ldf) + k;

// use indexes

for(j = 0; j < n; j++) a += w*fi[fi_index];
// for(j = 0; j .lt. n; j++) a += w*fi[fi_index];

*** Please fix this forum such that one can copy and paste sample code without having to figure out html
*** Or take our time to edit and post and reedit and repost and reedit and post, ....
*** Your time taken once will save our time over and over and over again.

Jim Dempsey

Could you explain more about "systems supporting gather"? Does windows 7 64bit support? What is "gather"? In the code the index is computed before the vector product. What's the reason of doing so? Thanks in advance!
SergeyKostrov
Valued Contributor II
121 Views

Hi Jim, . >>*** Please fix this forum such that one can copy and paste sample code without having to figure out html . I think IDZ developers are trying to force us C/C++ Software Developers to do more HTML programming. It looks like they follow Renee James keynote statements about HTML5. . Best regards, Sergey
SergeyKostrov
Valued Contributor II
121 Views

>>...I did. /Qunroll:100000000 . Here is a simmary from Intel C++ compiler documentation regarding loops unrolling: . Loop Unrolling . The benefits of loop unrolling are as follows: - Unrolling eliminates branches and some of the code. - Unrolling enables you to aggressively schedule (or pipeline) the loop to hide latencies if you have enough free registers to keep variables live. - For processors based on the IA-32 architectures, the processor can correctly predict the exit branch for an inner loop that has 16 or fewer iterations, if that number of iterations is predictable and there are no conditional branches in the loop. Therefore, if the loop body size is not excessive, and the probable number of iterations is known, unroll inner loops for the processors, until they have a maximum of 16 iterations - A potential limitation is that excessive unrolling, or unrolling of very large loops, can lead to increased code size. . There is an option /Qunroll-aggressive and here is some summary: . ... On systems using IA-64 architecture, you can only specify a value of 0. ... and ... This option determines whether the compiler uses more aggressive unrolling for certain loops. The positive form of the option may improve performance. On IA-32 architecture and Intel® 64 architecture, this option enables aggressive, complete unrolling for loops with small constant trip counts. On IA-64 architecture, this option enables additional complete unrolling for loops that have multiple exits or outer loops that have a small constant trip count. ... . There is also an option -funroll-all-loops and by default it is in OFF state. . [SergeyK Note] There is a very small performance improvement by changing unrolling from 4-in-1 to 8-in-1 and I always use 4-in-1. Regarding your concerns about unrolling: I would try to use it before making a final statement.
SergeyKostrov
Valued Contributor II
121 Views

>>Could you explain more: >>1. "lock in memory 'index' array using a 'VirtualLock' Win32 API function" . Here is some quote from MSDN: ... The VirtualLock function locks the specified region of the process's virtual address space into physical memory, ensuring that subsequent access to the region will not incur a page fault. ... I also recommend you to look at: . Intel threading Building Blocks forum: 'Working with Memory Mapped Files?' thread . Web-link: http://software.intel.com/en-us/forums/topic/276995 . Please take a look.
SergeyKostrov
Valued Contributor II
39 Views

>>...2. "software prefetch ( as inline assembler instruction ) could improve performance by 0.5 - 1.0%" . I use an inline assembler instruction 'prefetcht0' instead of a call to intrinsic function '_mm_prefetch' ( a more portable form with some additional call overhead ). In general it looks like: . [cpp] ... _asm prefetcht0 [ ptAddress ] ... [/cpp]
Reply