I am using a scientific calculation code. And I want to improve it a little bit if possible. I check the code with Amplifier. The most time consuming (heavily used) code is this:
double a = 0.0;
for(j = 0; j < n; j++) a += w
To me it is just a dot product between w and fi. I am wondering:
1. Does Intel compiler will do it automaticall? (I mean treated the loop as the dot product of two vecterized array.)
2. Is there a way to improve the code? (I mean maybe define another array a1 the same size of w. Then all multiplied number can be stored in a1 (unrolled loop?). Do summation in the end. )
3. Other suggestions?
I am using parallel composer 2013 with visual studio. Any idea will be appreicated！:)
Sergey Kostrov wrote:I did. /Qunroll:100000000 But I don't think the unrolling would contribute. You see each time the sum is "a += ...". There is a dependency here. Meaning the n+1 step depends on its previous step. If a was an array, you could unroll the loop but here a is just a double variable. That's my understanding of the unrolling. I added /Qunroll:100000000 there, but maybe it doesn't take any effect. I have 4GB free in my system (windows 7 64bit. 8GB total. And I am using 64bit config).
>>...SSE 4.1 did improve the performance. But still not very significant...
Did you try to unroll the loop? I have a similar piece of code ( not exactly, of course ) and I remember that unrolling 4-in-1 improved performance of summation. I'll provide some numbers if interested and try to do some testing with your codes.
>>...the data is around some GB in the memory...
How much memory do you have on your system? Is it 32-bit or 64-bit platform?
I wouldn't expect a significant improvement in performance if some parts of your data are saved in a Virtual Memory file during the processing due to lack of available Physical Memory.
Sergey Kostrov wrote:Sorry, it is a bit complicated. I don't have some simple examples for you to try out. Here is my understanding: Overall this is a piece of calculating the eigenvalue problem using conjugate gradient method. Maybe the code above is a part of preparing the matrix to solve. The matrix, for example in the test calculation is a million by million sparse matrix. And the non zero elements are 3 million I believe (some sort of tri-diagonal). But when come to the actual usage, maybe the size will go to 10 million or even bigger. I start to focus on it is because I checked it with amplifier. By looking at the result, I know the 2 most time consuming functions are the MKL (intel version I used) functions (dot product and the other one I don't know very well). That look is the most significant time consuming thing other than those.
>>...for(j = 0; j < n; j++) a += w
*fi[((index + i) left-shift ldf) + k];
Could you provide some technical details for i, ldf and k variables? Could I use any values for test purposes? Thanks in advance.
Note: I wonder if the editor problem with a left-arrow character could be fixed?
Sergey Kostrov wrote:Using float is actually works for time shrinking purpose But for the time being I'd like to keep the precision. All the matrices and vectors are already in the memory so there is no loading issue. I believe now the bottle neck is the cache. I never did raise priority. And I should try it. Could you explain more: 1. "lock in memory 'index' array using a 'VirtualLock' Win32 API function" 2. "software prefetch ( as inline assembler instruction ) could improve performance by 0.5 - 1.0%" Thanks a lot for your nice response!
>>...Or there are other optimization ways?
Here is a list of different techniques I would try:
- if accuracy is not a crucial issue use a single-precision data type 'float' instead of double-precision data type 'double'
- a distribution of values in 'index' array needs to be analyzed ( it creates a random acces to array 'fi' )
- lock in memory 'index' array using a 'VirtualLock' Win32 API function
- software prefetch ( as inline assembler instruction ) could improve performance by 0.5 - 1.0%
- raise a priority of the process / thread to Above Normal or High could improve performance by 0.5 - 1.0% ( do not change priority to Real Time )
- warm up your data by pre-scanning arrays before processing ( it will load as much as possible data into memory )
Let me know if you have any questions.
jimdempseyatthecove wrote:Could you explain more about "systems supporting gather"? Does windows 7 64bit support? What is "gather"? In the code the index is computed before the vector product. What's the reason of doing so? Thanks in advance!
The following may work better on systems supporting gather.
On systems without gather it may make better use of compiler temporaries.
double a = 0.0;
int* fi_index = (int*)_alloca(n*sizeof(int));
// construct indexes
for(j = 0; j < n; j++) fi_index
[((index + i) << ldf) + k;
// for(j = 0; j .lt. n; j++) fi_index
[((index + i) .lt..lt. ldf) + k;
// use indexes
for(j = 0; j < n; j++) a += w
// for(j = 0; j .lt. n; j++) a += w
*** Please fix this forum such that one can copy and paste sample code without having to figure out html
*** Or take our time to edit and post and reedit and repost and reedit and post, ....
*** Your time taken once will save our time over and over and over again.