Improve the performance of a sum - Page 4

FortCpp · ‎10-02-2012

Hello Intel,

I am using a scientific calculation code. And I want to improve it a little bit if possible. I check the code with Amplifier. The most time consuming (heavily used) code is this:

[cpp]

double a = 0.0;
for(j = 0; j < n; j++) a += w*fi[((index + i)<<ldf) + k];

[/cpp]

To me it is just a dot product between w and fi. I am wondering:

1. Does Intel compiler will do it automaticall? (I mean treated the loop as the dot product of two vecterized array.)

2. Is there a way to improve the code? (I mean maybe define another array a1 the same size of w. Then all multiplied number can be stored in a1 (unrolled loop?). Do summation in the end. )

3. Other suggestions?

I am using parallel composer 2013 with visual studio. Any idea will be appreicated！:)

SergeyKostrov · ‎10-22-2012

>>I always recommend to read Intel's article related to that subject: >> >>'Consistency of Floating-Point Results using the Intel(R) Compiler' By Dr. Martyn J. Corden and David Kreitzer >> >>I could upload if you won't be able to find the article. The article is attached.

SergeyKostrov · ‎10-22-2012

Here are a couple of very useful web-links to look at: http://en.wikipedia.org/wiki/Single_precision http://www.binaryconvert.com http://www.binaryconvert.com/convert_float.html [COMMENTED] There is a strange re-formatting issue.

SergeyKostrov · ‎10-23-2012

>>...That would be very nice! Can you send me your source? YLQK9@mail.missouri.edu Please check your e-mail and I also enclosed the sources.

SergeyKostrov · ‎10-28-2012

>>...I did. /Qunroll:100000000 . That number looks to big and I don't think that would help. I always use 4-in-1. Here is a small example: [cpp] // Without Unrolling ( 1-in-1 ) for( int i = 0; i < N; i++ ) { a += b; } [/cpp] [cpp] // With Unrolling ( 4-in-1 ) for( int i = 0; i < N; i+=4 ) { a += ( b + b[i+1] + b[i+2] + b[i+3] ); } [/cpp] Note: A verification that N % 4 equals to 0 needs to be done and if it is not 0 additional processing required.

SergeyKostrov · ‎10-31-2012

>>...The precision issue is the most important thing right now... On a web-page: . http://www.tddft.org/programs/octopus/wiki/index.php/Manual:Installation there is a statement >>... >>testsuite/ >>Used to check your build. You may also use the files in here as samples of how to do various types of calculations. - Do you have some reference data to compare with? - How did you detect that precision loss? - Could you post some technical details?

SergeyKostrov · ‎11-07-2012

Hi everybody, I'd like to share results of my additional test and I'm very impressed since a ~15 year old Borland C++ compiler outperformed all (!) modern C++ compilers in a test case when all optimizations were disabled: ... >> With Borland C++ compiler << >> Non-Deterministic Tests << >> Array size: 32MB 'double' elements << *** Set of Full Sum Tests *** Full Sum : Rolled Loops - 1-in-1 Sum is: 0.000000 Calculated in 2110 ticks Full Sum : UnRolled Loops - 4-in-1 - A Sum is: 0.000000 Calculated in 2094 ticks Full Sum : UnRolled Loops - 4-in-1 - B // <= Best Time ( without priority boost ) Sum is: 0.000000 Calculated in 2000 ticks Full Sum : UnRolled Loops - 8-in-1 - A Sum is: 0.000000 Calculated in 2671 ticks Full Sum : UnRolled Loops - 8-in-1 - B Sum is: 0.000000 Calculated in 2313 ticks Process Priority High Full Sum : UnRolled Loops - 4-in-1 - B Sum is: 0.000000 Calculated in 1984 ticks Process Priority Realtime Full Sum : UnRolled Loops - 4-in-1 - B // <= Best Time ( with priority boost ) Sum is: 0.000000 Calculated in 1953 ticks ...