Improve the performance of a sum - Page 3

FortCpp · ‎10-02-2012

Hello Intel,

I am using a scientific calculation code. And I want to improve it a little bit if possible. I check the code with Amplifier. The most time consuming (heavily used) code is this:

[cpp]

double a = 0.0;
for(j = 0; j < n; j++) a += w*fi[((index + i)<<ldf) + k];

[/cpp]

To me it is just a dot product between w and fi. I am wondering:

1. Does Intel compiler will do it automaticall? (I mean treated the loop as the dot product of two vecterized array.)

2. Is there a way to improve the code? (I mean maybe define another array a1 the same size of w. Then all multiplied number can be stored in a1 (unrolled loop?). Do summation in the end. )

3. Other suggestions?

I am using parallel composer 2013 with visual studio. Any idea will be appreicated！:)

SergeyKostrov · ‎10-17-2012

[EDITED] Sorry for the duplicate of the previous post. All content is deleted.

jimdempseyatthecove · ‎10-17-2012

You might consider the following: double a = 0.0; int scale = 1 {left shift} ldf; int offset = i {left shift} ldf + k; // assuming stack large enough... // use alloca to allocate a 16-byte aligned array int* reindex = (int*)(((intptr_t)alloca(n*sizeof(int)+15)+15)&(~(intptr_t)15)); // assure that your index array is 16-byte aligned _ASSERT((((intptr_t)index) & 15) == 0); #pragma simd for(int j = 0; j {less than} n; j++) reindex = index * scale + offset; // since w alignment is unknown and fi[reindex] is unknown // we cannot use #pragma simd here // however, on compilers and processors supporting scatter/gather // the following loop can be vectorized. // On compilers (and processors) not supporting scatter/gather // the following loop is unrollable and may be partially vectorized. for(int j = 0; j {less than} n; j++) a += w*fi[reindex]; Jim Dempsey (down with html comment edit boxes)

SergeyKostrov · ‎10-17-2012

Jim suggested you create an index array like... >>// construct indexes >> >>for(j = 0; j < n; j++) fi_index [((index + i) << ldf) + k; >> >>// use indexes >> >>for(j = 0; j < n; j++) a += w*fi[fi_index]; So, I didn't try to test it due to lack of time. But it is clear that it will improve performance. In overall, if all our proposals are combined I expect final performance improvement by ~10% to 15%. It would be nice to hear from you about a total time of calculations for not optimized version of that free software. By the way, what software are you taking about?

SergeyKostrov · ‎10-18-2012

>>...it is clear that it will improve performance. I've verified it in a set of my test-cases for 16MB and 32MB sizes of arrays and results are as follows: - improvement was for about 4% to 6% depending on sizes of arrays ( so, it really works! ) compared to >>... >>Full Sum : UnRolled Loops - 4-in-1 - B // <= Best Time ( without priority boost - ~6% faster than 1-in-1 test-case ) >>Sum is: 0.000000 >>Calculated in 2062 ticks >>... - it also outperformed a test-case with a priority boost to Realtime - on a 32-bit Windows platform total amount of allocated memory was about 1.3GB ( Virtual Memory manager was busy but it is Not applicable in your case ) - the indexing solution can be be used if 'i', 'ldf' and 'k' used like "constants" and if they are Not a reindexing will be needed >>...In overall, if all our proposals are combined I expect final performance improvement by ~10% to 15%... I think an increase in performance greater than 15% could be achieved.

FortCpp · ‎10-19-2012

Thanks a lot, Sergey for all your update. I was busy this week for some precision issue. Thanks Timp and Jim as well. I tried to make a piece of code out of the source. Still I may not have a better description. But I think I can provide more information of the loop. What I did is I tried to follow the loop in the debug mode. Here is the story: After many calling, I came to a subroutine. The purpose of the code is to "apply a Hamiltonian (matrix) to a vector". The I follow the subroutines, I came to here: [fortran] call operate_ri_vec(op%stencil%size, wre(1), nri_loc, ri(1, ini), imin(ini), imax(ini), pfi(1), LOGLDF, pfo(1)) [/fortran] This is a fortran code but it called a c function. The definition of the c function is: [cpp] void operate_ri_vec_ (const int * opn, const double * restrict w, const int * opnri, const int * opri, const int * rimap_inv, const int * rimap_inv_max, const double * restrict fi, const int * ldfp, double * restrict fo){ [/cpp] And I believe the some fortran parameters (except int, double, and char) are addresses. So op%stencil%size (some type of structure) is a number, which value is 7. wre is a 1D integer array. nri_loc=11722. pfi and pfo are 1D double arrays. ri, imin and imax are the integer arrays. LOGLDF is a macro and defined to 1. The I move on to follow the function. [cpp] const int ldf = ldfp[0]; [/cpp] In the function, I think it tells me that the ldf is 1 and it is a const. And then, [cpp] for (l = 0; l < 11722 ; l++) { index = opri + 7 * l; ...... [/cpp] I've replace some of the variables with their values. opri is another address. The loop that I mentioned earlier is actually inside the loop above if I am correct. [cpp] double a = 0.0; for(j = 0; j < n; j++) a += w*fi[((index + i)<+i)<<2 = index*2 + i *2. So when the code is running, all the numbers which are involved in the vector vector product (sum) is something scattered around. I don't have a better picture for it so far. For a proper case which can be used to test, It would be something like this: [cpp] a is a double array with 1000 elements. b is another double array with 1000,000 elements. c is a integer array with 1000 random numbers (between 0~999,999) and sorted elements inside (All elements are different form each other). c is the index of b (mask). the whole problem is a vector product but with a mask. [/cpp] The code is open source project and you can find it here: http://www.tddft.org/trac/octopus/browser/trunk/src/grid/operate_inc.c It is already a nice optimization. They mentioned AVX, I have no idea of it.

FortCpp · ‎10-19-2012

Hi Sergey and Timp, I am catching up. And I saw you have some discussion regard the /fp:precise. Since I have a accuracy issue in my calculations, could you educate me more about it? Also, when I did a benchmark between the same code compiled on different platform (win64 vs ubuntu 64bit), I saw the result looked exactly the same at the beginning of the iterations, but later on, there are slightly different from each other. Does such kind of issue related to the float point model?

FortCpp · ‎10-19-2012

Sergey Kostrov wrote:
>>...I don't think I will manually unroll the loop. But you are right, I should unroll the loop...

As you can see in both cases test 'UnRolled Loops - 4-in-1 - B' has the best times. More information will be submitted soon.

>> Deterministic Tests <<
>> Array size: 32MB 'double' elements <<

*** Set of Full Sum Tests ***
Full Sum : Rolled Loops - 1-in-1
Sum is: 0.000000
Calculated in 609 ticks

Full Sum : UnRolled Loops - 4-in-1 - A
Sum is: 0.000000
Calculated in 609 ticks

Full Sum : UnRolled Loops - 4-in-1 - B // <= Best Time ( ~23% faster than 1-in-1 test case )
Sum is: 0.000000
Calculated in 469 ticks

Full Sum : UnRolled Loops - 8-in-1 - A
Sum is: 0.000000
Calculated in 1172 ticks

Full Sum : UnRolled Loops - 8-in-1 - B
Sum is: 0.000000
Calculated in 844 ticks

>> Non-Deterministic Tests <<
>> Array size: 32MB 'double' elements <<

*** Set of Full Sum Tests ***
Full Sum : Rolled Loops - 1-in-1
Sum is: 0.000000
Calculated in 2219 ticks

Full Sum : UnRolled Loops - 4-in-1 - A
Sum is: 0.000000
Calculated in 2203 ticks

Full Sum : UnRolled Loops - 4-in-1 - B // <= Best Time ( ~6% faster than 1-in-1 test case )
Sum is: 0.000000
Calculated in 2078 ticks

Full Sum : UnRolled Loops - 8-in-1 - A
Sum is: 0.000000
Calculated in 2860 ticks

Full Sum : UnRolled Loops - 8-in-1 - B
Sum is: 0.000000
Calculated in 2687 ticks

I'll apply it. So you used /fp:precise here as well?

FortCpp · ‎10-19-2012

Sergey Kostrov wrote:
And two more things:

- In my tests I've simulated two types of accesses and didn't populate 'w' and 'fi' arrays with some random values. They were initialized with 0.0 values. That is why the sum is always:
...
Sum is: 0.000000
...

- Let me know if you need source codes of my tests

That would be very nice! Can you send me your source? YLQK9@mail.missouri.edu

SergeyKostrov · ‎10-19-2012

>>... >>289 #if VEC_SIZE > 1 >>290 for (; i < rimap_inv_max; i++){ >>291 int k; >>292 for(k = 0; k < (1<>293 double a = 0.0; >>294 for(j = 0; j < n; j++) a += w*fi[((index + i)<>295 fo[(i<>296 } >>297 } >>298 #endif I have two questions: Do you know a starting value for 'i', values for 'rimap_inv_max' and 'n' variables? I also would like to confirm that 'ldf'=1, correct? I'll go through your last 3 posts tomorrow.

jimdempseyatthecove · ‎10-20-2012

>>The code is open source project and you can find it here: http://www.tddft.org/trac/octopus/browser/trunk/src/grid/operate_inc.c Not to be overly hard on the programmer/author the code is rather sophomoric. Firstoff, the number of available SSE and AVX registers is 16. Xeon Phi (AKA Knights Corner), which currently is a co-processor, has 32 AVX2-extended registers (64-bytes/8 doubles/16 floats). So currently, unless your code is running on Xeon Phi co-processor, the "#if DEPTH >= 32" section is likely counter productive. Secondly, the programmer is computing byte offsets, "LOAD(fi + indexj + 1*VEC_SIZE + k)". The compiler optimizations, often are confused by using these tactics. A better method is (untested code) [cpp] #if DEPTH >= 32 for (; i < (rimap_inv_max - unroll32 + 1) ; i += unroll32) { int k; for(k = 0; k < (1 {left shift} ldf); k += 32*VEC_SIZE) { register VEC_TYPE a0, a1, a2, a3; register VEC_TYPE a4, a5, a6, a7; register VEC_TYPE a8, a9, aa, ab; register VEC_TYPE ac, ad, ae, af; register VEC_TYPE b0, b1, b2, b3; register VEC_TYPE b4, b5, b6, b7; register VEC_TYPE b8, b9, ba, bb; register VEC_TYPE bc, bd, be, bf; a0 = a1 = a2 = a3 = VEC_ZERO; a4 = a5 = a6 = a7 = VEC_ZERO; a8 = a9 = aa = ab = VEC_ZERO; ac = ad = ae = af = VEC_ZERO; b0 = b1 = b2 = b3 = VEC_ZERO; b4 = b5 = b6 = b7 = VEC_ZERO; b8 = b9 = ba = bb = VEC_ZERO; bc = bd = be = bf = VEC_ZERO; for(j = 0; j < n; j++) { register VEC_TYPE wj = VEC_SCAL(w); int indexj = (index + i) {left shift} ldf; VEC_TYPE* fiVec = (VEC_TYPE*)(fi + indexj + k); a0 = VEC_FMA(wj, LOAD(fiVec ), a0); a1 = VEC_FMA(wj, LOAD(fiVec + 1), a1); a2 = VEC_FMA(wj, LOAD(fiVec + 2), a2); a3 = VEC_FMA(wj, LOAD(fiVec + 3), a3); a4 = VEC_FMA(wj, LOAD(fiVec + 4), a4); a5 = VEC_FMA(wj, LOAD(fiVec + 5), a5); a6 = VEC_FMA(wj, LOAD(fiVec + 6), a6); a7 = VEC_FMA(wj, LOAD(fiVec + 7), a7); a8 = VEC_FMA(wj, LOAD(fiVec + 8), a8); a9 = VEC_FMA(wj, LOAD(fiVec + 9), a9); aa = VEC_FMA(wj, LOAD(fiVec + 10), aa); ab = VEC_FMA(wj, LOAD(fiVec + 11), ab); ac = VEC_FMA(wj, LOAD(fiVec + 12), ac); ad = VEC_FMA(wj, LOAD(fiVec + 13), ad); ae = VEC_FMA(wj, LOAD(fiVec + 14), ae); af = VEC_FMA(wj, LOAD(fiVec + 15), af); b0 = VEC_FMA(wj, LOAD(fiVec + 16), b0); b1 = VEC_FMA(wj, LOAD(fiVec + 17), b1); b2 = VEC_FMA(wj, LOAD(fiVec + 18), b2); b3 = VEC_FMA(wj, LOAD(fiVec + 19), b3); b4 = VEC_FMA(wj, LOAD(fiVec + 20), b4); b5 = VEC_FMA(wj, LOAD(fiVec + 21), b5); b6 = VEC_FMA(wj, LOAD(fiVec + 22), b6); b7 = VEC_FMA(wj, LOAD(fiVec + 23), b7); b8 = VEC_FMA(wj, LOAD(fiVec + 24), b8); b9 = VEC_FMA(wj, LOAD(fiVec + 25), b9); ba = VEC_FMA(wj, LOAD(fiVec + 26), ba); bb = VEC_FMA(wj, LOAD(fiVec + 27), bb); bc = VEC_FMA(wj, LOAD(fiVec + 28), bc); bd = VEC_FMA(wj, LOAD(fiVec + 29), bd); be = VEC_FMA(wj, LOAD(fiVec + 30), be); bf = VEC_FMA(wj, LOAD(fiVec + 31), bf); } VEC_TYPE* foVec = (VEC_TYPE*)(fo + (i {left shift} ldf) + k); STORE(foVec , a0); STORE(foVec + 1, a1); STORE(foVec + 2, a2); STORE(foVec + 3, a3); STORE(foVec + 4, a4); STORE(foVec + 5, a5); STORE(foVec + 6, a6); STORE(foVec + 7, a7); STORE(foVec + 8, a8); STORE(foVec + 9, a9); STORE(foVec + 10, aa); STORE(foVec + 11, ab); STORE(foVec + 12, ac); STORE(foVec + 13, ad); STORE(foVec + 14, ae); STORE(foVec + 15, af); STORE(foVec + 16, b0); STORE(foVec + 17, b1); STORE(foVec + 18, b2); STORE(foVec + 19, b3); STORE(foVec + 20, b4); STORE(foVec + 21, b5); STORE(foVec + 22, b6); STORE(foVec + 23, b7); STORE(foVec + 24, b8); STORE(foVec + 25, b9); STORE(foVec + 26, ba); STORE(foVec + 27, bb); STORE(foVec + 28, bc); STORE(foVec + 29, bd); STORE(foVec + 30, be); STORE(foVec + 31, bf); } } #endif [/cpp] Jim Dempsey

jimdempseyatthecove · ‎10-20-2012

GRIPE: One cannot paste .cpp code into [cpp]paste here[/cpp] without getting extra line breaks. Please fix. Also note the C/C++ {Left Shift} does not paste in this block as well. Please fix. Jim Dempsey

SergeyKostrov · ‎10-20-2012

Hi Jim, It is hard to see your optimization without comparing it the original code. Do you mean an addition of 'indexj' variable? ... 049 int indexj = (index + i) {left shift} ldf; 051 VEC_TYPE* fiVec = (VEC_TYPE*)(fi + indexj + k); ...

jimdempseyatthecove · ‎10-21-2012

The edits recommended replace LOAD(fi + indexj + n*VEC_SIZE + k) with LOAD(fiVec + n) Where n is sequenced from 0:31 And fiVec is a VEC_TYPE* as opposed to the original fi being an FPTYPE* (double* or float*) The compiler has optimizations specifically built for pointer, pointer+1, pointer+2, ... The original code, "should" optimize to the same, but this is not always the case. To verify, run the test (about 10 minutes of work). Jim Dempsey

FortCpp · ‎10-21-2012

jimdempseyatthecove wrote:
You might consider the following:

double a = 0.0;
int scale = 1 {left shift} ldf;
int offset = i {left shift} ldf + k;
// assuming stack large enough...
// use alloca to allocate a 16-byte aligned array
int* reindex = (int*)(((intptr_t)alloca(n*sizeof(int)+15)+15)&(~(intptr_t)15));
// assure that your index array is 16-byte aligned
_ASSERT((((intptr_t)index) & 15) == 0);
#pragma simd
for(int j = 0; j {less than} n; j++)
reindex = index * scale + offset;
// since w alignment is unknown and fi[reindex] is unknown
// we cannot use #pragma simd here
// however, on compilers and processors supporting scatter/gather
// the following loop can be vectorized.
// On compilers (and processors) not supporting scatter/gather
// the following loop is unrollable and may be partially vectorized.
for(int j = 0; j {less than} n; j++)
a += w*fi[reindex];

Jim Dempsey
(down with html comment edit boxes)

Thanks Jim. I just took your code and applied. But the improvement is not significant. I think your idea is calculate the index before the vector product is done. Don't know the reason.

jimdempseyatthecove · ‎10-22-2012

When your indexes are totally random, then precomputing a table of indexes _may_ be faster on systems without gather. When you upgrade your processor to one that supports gather and recompile for gather, then the same code should be faster. Future proofing your code. When you have a random index combined with a series of sequential references (index, index+1, index+2, ...), then it may be best to compute a reference/pointer to the index'ed location (p) then use p[0], p[1], p[2], ... The code looks cleaner, and it may run faster. Jim Dempsey

SergeyKostrov · ‎10-22-2012

Here are a couple of follow ups. >>... >>Full Sum : UnRolled Loops - 4-in-1 - B // <= Best Time ( ~6% faster than 1-in-1 test case ) >>Sum is: 0.000000 >>Calculated in 2078 ticks >> >>Full Sum : UnRolled Loops - 8-in-1 - A >>Sum is: 0.000000 >>Calculated in 2860 ticks >>... >> >>I'll apply it. So you used /fp:precise here as well? Yes.

SergeyKostrov · ‎10-22-2012

>>...I saw you have some discussion regard the /fp:precise. Since I have a accuracy issue in my calculations, could you educate me >>more about it? ** Please take a look at these topics in MSDN: ** ...- /fp (Specify Floating-Point Behavior) ...- Floating Point Numbers ** Ways to control Floating Point Models ( modes ) are as follows: ** - C++ Compiler options /fp:[ precise | except | fast | strict ] - Precise, Except, Fast and Strict Floating Point models could be controlled in source codes using 'float_control' pragma directive: ...#pragma float_control ( precise, on ) ...#pragma float_control ( except, off ) ...#pragma float_control ( fast, off ) ...#pragma float_control ( strict, off ) - A set of CRT-functions '_control87', '_controlfp' and '__control87_2' allow to change Floating Point modes, like precision, rounding and infinity. You could also look at '_status87' and '_fpreset' CRT-functions. When I use '_control87' I always select default modes. - Application of SSE instructions in CRT-functions, like sin, cos, could be controlled with '_set_SSE2_enable' CRT-function. - Take into account that some features or functions are Microsoft / Intel C++ compilers specific.

SergeyKostrov · ‎10-22-2012

>>...I saw the result looked exactly the same at the beginning of the iterations, but later on, there are slightly different from each other. Does such >>kind of issue related to the float point model? I think Yes and it is possibly related to different rounding modes. Could you post a small example?

SergeyKostrov · ‎10-22-2012

Remember, that due to limitations of IEE 754 Standard rounding errors could be accumulated very quickly! Please take at a very simple example: Let's say two square 8x8 matrices A and B need to be multiplied. // Matrix A - 8x8 - 'float' type: 101.0 201.0 301.0 401.0 501.0 601.0 701.0 801.0 901.0 1001.0 1101.0 1201.0 1301.0 1401.0 1501.0 1601.0 1701.0 1801.0 1901.0 2001.0 2101.0 2201.0 2301.0 2401.0 2501.0 2601.0 2701.0 2801.0 2901.0 3001.0 3101.0 3201.0 3301.0 3401.0 3501.0 3601.0 3701.0 3801.0 3901.0 4001.0 4101.0 4201.0 4301.0 4401.0 4501.0 4601.0 4701.0 4801.0 4901.0 5001.0 5101.0 5201.0 5301.0 5401.0 5501.0 5601.0 5701.0 5801.0 5901.0 6001.0 6101.0 6201.0 6301.0 6401.0 // Matrix B - 8x8 - 'float' type: 101.0 201.0 301.0 401.0 501.0 601.0 701.0 801.0 901.0 1001.0 1101.0 1201.0 1301.0 1401.0 1501.0 1601.0 1701.0 1801.0 1901.0 2001.0 2101.0 2201.0 2301.0 2401.0 2501.0 2601.0 2701.0 2801.0 2901.0 3001.0 3101.0 3201.0 3301.0 3401.0 3501.0 3601.0 3701.0 3801.0 3901.0 4001.0 4101.0 4201.0 4301.0 4401.0 4501.0 4601.0 4701.0 4801.0 4901.0 5001.0 5101.0 5201.0 5301.0 5401.0 5501.0 5601.0 5701.0 5801.0 5901.0 6001.0 6101.0 6201.0 6301.0 6401.0 // Matrix C = Matrix A * Matrix B - 8x8 - 'float' type 13826808.0 14187608.0 14548408.0 14909208.0 15270008.0 15630808.0 15991608.0 16352408.0 32393208.0 33394008.0 34394808.0 35395608.0 36396408.0 37397208.0 38398008.0 39398808.0 >50959604.0< 52600404.0 54241204.0 55882004.0 57522804.0 59163604.0 60804404.0 62445204.0 69526008.0 71806808.0 74087608.0 76368408.0 78649208.0 80930008.0 83210808.0 85491608.0 88092408.0 91013208.0 93934008.0 96854808.0 99775608.0 102696408.0 105617208.0 108538008.0 106658808.0 110219608.0 113780408.0 117341208.0 120902008.0 124462808.0 128023608.0 131584408.0 125225208.0 129426008.0 133626808.0 137827616.0 142028400.0 146229216.0 150430000.0 154630816.0 143791600.0 148632416.0 153473200.0 158314016.0 163154800.0 167995616.0 172836416.0 177677200.0 I marked one of the several incorrect values as '>50959604.0<' and due to a rounding error it is NOT equal to '50959608.0'. If you change the 'float' type to 'double' the result of matrix multiplication will be correct. For ALL values of the matrix C last two digits must be '08'. It can't be '...00', '...04' or '...16'.

SergeyKostrov · ‎10-22-2012

I always recommend to read Intel's article related to that subject: 'Consistency of Floating-Point Results using the Intel(R) Compiler' By Dr. Martyn J. Corden and David Kreitzer I could upload if you won't be able to find the article.

TimP · ‎10-22-2012

** Please take a look at these topics in MSDN: ** ...- /fp (Specify Floating-Point Behavior) ...- Floating Point Numbers ** Ways to control Floating Point Models ( modes ) are as follows: ** - C++ Compiler options /fp:[ precise | except | fast | strict ] - Precise, Except, Fast and Strict Floating Point models could be controlled in source codes using 'float_control' pragma directive: ...#pragma float_control ( precise, on ) ...#pragma float_control ( except, off ) ...#pragma float_control ( fast, off ) ...#pragma float_control ( strict, off ) - A set of CRT-functions '_control87', '_controlfp' and '__control87_2' allow to change Floating Point modes, like precision, rounding and infinity. You could also look at '_status87' and '_fpreset' CRT-functions. When I use '_control87' I always select default modes. - Application of SSE instructions in CRT-functions, like sin, cos, could be controlled with '_set_SSE2_enable' CRT-function. - Take into account that some features or functions are Microsoft / Intel C++ compilers specific. __control and 87 functions would have no effect on compilations in x64 mode or with /arch:SSE2 or AVX. I think Sergei means use of x86/ia32 mode without /arch: when he mentions using them with default options. Intel compilers have /fp:source as a rough equivalent to Microsoft /fp:fast. Yet, the Intel compilers make their more aggressive /fp:fast the default, while you must set /fp:fast to see competitive optimizations with MSVC (particularly in 32-bit mode). Intel Fortran interprets /fp:precise to be the same as /fp:source, but they are different for Intel C++. I used to think the differences between the Intel and MSVC interpretations were partly explained by auto-vectorization, but now VS2012 has some auto-vectorization (apparently not including vector reductions).