Improve the performance of a sum - Page 2

FortCpp · ‎10-02-2012

Hello Intel,

I am using a scientific calculation code. And I want to improve it a little bit if possible. I check the code with Amplifier. The most time consuming (heavily used) code is this:

[cpp]

double a = 0.0;
for(j = 0; j < n; j++) a += w*fi[((index + i)<<ldf) + k];

[/cpp]

To me it is just a dot product between w and fi. I am wondering:

1. Does Intel compiler will do it automaticall? (I mean treated the loop as the dot product of two vecterized array.)

2. Is there a way to improve the code? (I mean maybe define another array a1 the same size of w. Then all multiplied number can be stored in a1 (unrolled loop?). Do summation in the end. )

3. Other suggestions?

I am using parallel composer 2013 with visual studio. Any idea will be appreicated！:)

SergeyKostrov · ‎10-11-2012

>>...I never did raise priority. And I should try it. . These Win32 API functions should be used: . GetCurrentProcess SetPriorityClass GetCurrentThread SetThreadPriority

SergeyKostrov · ‎10-11-2012

>>>>...for(j = 0; j < n; j++) a += w*fi[((index + i) left-shift ldf) + k]; >>>>. >>>>Could you provide some technical details for i, ldf and k variables? >> >>Sorry, it is a bit complicated. . There are lots of smart guys here and if you really want to get help some generic details will help significantly. Could you tell what is a value for 'n' in the 'for' statement?

jimdempseyatthecove · ‎10-11-2012

blockquote>

Could you explain more about "systems supporting gather"? Does windows 7 64bit support? What is "gather"?
In the code the index is computed before the vector product. What's the reason of doing so?

Thanks in advance!

In the next generation of AVX for Haswell there is a new feature whereby the AVX instruction set permits scatter/gather. Consult: http://software.intel.com/sites/default/files/m/f/7/c/36945 for the nitty-gritty Note than when your system supports and compiler supports AVX gather, then the compiler may generate gather directly from the source code or you can alternately use intrinsic functions to perform the gather at a lower level Condiser the AVX instruction for gathering 4 doubles. This instruction specifies a 4-wide (dp) AVX destination register, an array address, a ymm register containing 4 indicies into the array, a ymm register containing a mask (to enable or disable loading array elements specified by the indicies). In the suggestion I made, should you pre-create an array of indicies (either all at once, or 4 at a time), then your code can perform the loads more efficiently: One fetch of 4 indicies One instruction to fetch into a 4-wide register from 4 seperate locations in an array (still have 4 fetches) But now note all 4 data elements are now packed into an AVX register without additional instructions. Now the vector operation(s) can be directly performed (e.g. sum, dot product, ...) --- On systems without AVX gather, the pre-building of the array of indicies may run faster (give it a try, it will be faster to experiment than writing these back and forth forum posts). Reason being, the compiler optimizer may better see what you are doing and generate more efficient code. Try to keep the size of the array of indicies small enough to fit in L1 cache. Nest the loop an additional level in the event you exceed L1 cache size. Jim Dempsey

FortCpp · ‎10-11-2012

Sergey Kostrov wrote:
>>...I did. /Qunroll:100000000
.
Here is a simmary from Intel C++ compiler documentation regarding loops unrolling:
.
Loop Unrolling
.
The benefits of loop unrolling are as follows:
- Unrolling eliminates branches and some of the code.
- Unrolling enables you to aggressively schedule (or pipeline) the loop to hide latencies if you have enough free registers to keep variables live.
- For processors based on the IA-32 architectures, the processor can correctly predict the exit branch for an inner loop that has 16 or fewer iterations, if that number of iterations is predictable and there are no conditional branches in the loop. Therefore, if the loop body size is not excessive, and the probable number of iterations is known, unroll inner loops for the processors, until they have a maximum of 16 iterations
- A potential limitation is that excessive unrolling, or unrolling of very large loops, can lead to increased code size.
.
There is an option /Qunroll-aggressive and here is some summary:
.
...
On systems using IA-64 architecture, you can only specify a value of 0.
...
and
...
This option determines whether the compiler uses more aggressive unrolling for certain loops. The positive form of the option may improve performance.
On IA-32 architecture and Intel® 64 architecture, this option enables aggressive, complete unrolling for loops with small constant trip counts.
On IA-64 architecture, this option enables additional complete unrolling for loops that have multiple exits or outer loops that have a small constant trip count.
...
.
There is also an option -funroll-all-loops and by default it is in OFF state.
.
[SergeyK Note] There is a very small performance improvement by changing unrolling from 4-in-1 to 8-in-1 and I always use 4-in-1. Regarding your concerns about unrolling: I would try to use it before making a final statement.

I set a big number to loop unroll because I thought the compiler would decide whether and how the unroll should be taken. Maybe I should go with "unroll all loop" option. You really gave me a lesson of loop unrolling. I thought the unrolling is just repeat the loop body several times and I didn't know the loop can be unrolled as you explained ( a += ( b + b[i+1] + b[i+2] + b[i+3] ); ). I don't think I will manually unroll the loop. But you are right, I should unroll the loop.

FortCpp · ‎10-11-2012

Sergey Kostrov wrote:
>>...2. "software prefetch ( as inline assembler instruction ) could improve performance by 0.5 - 1.0%"
.
I use an inline assembler instruction 'prefetcht0' instead of a call to intrinsic function '_mm_prefetch' ( a more portable form with some additional call overhead ). In general it looks like:
.
...

	_asm prefetcht0	[ ptAddress ]

...

I am not sure when to use it. Does it tell the processor to "prefetch" data into cache?

FortCpp · ‎10-11-2012

Sergey Kostrov wrote:
>>...I never did raise priority. And I should try it.
.
These Win32 API functions should be used:
.
GetCurrentProcess
SetPriorityClass
GetCurrentThread
SetThreadPriority

OK. I'll take this into account.

FortCpp · ‎10-11-2012

Sergey Kostrov wrote:
>>>>...for(j = 0; j < n; j++) a += w*fi[((index + i) left-shift ldf) + k];
>>>>.
>>>>Could you provide some technical details for i, ldf and k variables?
>>
>>Sorry, it is a bit complicated.
.
There are lots of smart guys here and if you really want to get help some generic details will help significantly. Could you tell what is a value for 'n' in the 'for' statement?

True. I want to tell you more. But it is a code piece from a free software. This is the most inner layer of the self consistent iteration loop. And if you browse the code, you can see the author actually did some smart optimization. When I check it with Amplifier, I know the most time consuming (except MKL routines) is the loop. To extract a better picture, I have to track through some functions. It is a fortran C mixed program. The C code part is written and can be called by fortran subroutines aiming at speeding up the code. I'll try to make it clearer over the weekend or maybe early next week. I am sorry, but it is something not that easy to me. Just give me some time, maybe I can post it here. And I'd thanks for your kindness, Sergey.

jimdempseyatthecove · ‎10-12-2012

FortCpp wrote:
Hello Intel,
double a = 0.0;
 for(j = 0; j < n; j++) a += w*fi[((index + i)<<ldf) + k];
3. Other suggestions?

How about a small working sample program with dummied up data... That exhibits approximately the same behavior. This may yield better results for you and require less time of the other readers. Jim Dempsey

SergeyKostrov · ‎10-12-2012

>>... It is a fortran C mixed program... . Please take into account that I don't have a Fortran compiler. As far as I know Jim has it. So, a pure C or C++ test-case should work for everybody.

SergeyKostrov · ‎10-12-2012

Here is a more advanced example: [cpp] /////////////////////////////////////////////////////////////////////////////// // Test1143 - Research on Data Prefetching #if defined ( _RTTEST_1143_INCLUDED ) // #define _RTPREFETCH_DATA( pAddress, iOffset ) // #define _RTPREFETCH_DATA( pAddress, iOffset ) _mm_prefetch( ( ( RTchar * )( pAddress ) ) + iOffset, _MM_HINT_T0 ) // #define _RTPREFETCH_DATA( pAddress ) _asm prefetcht0 [ ##pAddress ] #define CrtDataPrefetchT0( pAddress ) #define CrtDataPrefetchT1( pAddress ) #define CrtDataPrefetchT2( pAddress ) #define CrtDataPrefetchNTA( pAddress ) // #define CrtDataPrefetchT0( pAddress ) _asm prefetcht0 [ ##pAddress ] // #define CrtDataPrefetchT1( pAddress ) _asm prefetcht1 [ ##pAddress ] // #define CrtDataPrefetchT2( pAddress ) _asm prefetcht2 [ ##pAddress ] // #define CrtDataPrefetchNTA( pAddress ) _asm prefetchnta [ ##pAddress ] #define _RTPREFETCH_DATASIZE 64 #define _RTVECTOR_SIZE 2048 #endif #if defined ( _RTTEST_1143_INCLUDED ) RTvoid RunTest1143( RTvoid ) { CrtPrintf( g_szMessageTestStart, 1143 ); g_uiTicksStart = SysGetTickCount(); // Sub-Test 1 { ///* #if ( defined ( _WIN32_MSC ) || defined ( _WIN32_ICC ) ) RTint iSize1 = sizeof( __m128 ); RTint iSize2 = sizeof( __m128d ); __m128 *pVec1 = ( __m128 * )_mm_malloc( _RTVECTOR_SIZE * sizeof( __m128 ), 16 ); __m128 *pVec2 = ( __m128 * )_mm_malloc( _RTVECTOR_SIZE * sizeof( __m128 ), 16 ); __m128 *pVec3 = ( __m128 * )_mm_malloc( _RTVECTOR_SIZE * sizeof( __m128 ), 16 ); RTint i; for( i = 0; i < _RTVECTOR_SIZE; i++ ) { pVec1.m128_f32[0] = pVec1.m128_f32[1] = pVec1.m128_f32[2] = pVec1.m128_f32[3] = 2.0f; pVec2.m128_f32[0] = pVec2.m128_f32[1] = pVec2.m128_f32[2] = pVec2.m128_f32[3] = 2.0f; pVec3.m128_f32[0] = pVec3.m128_f32[1] = pVec3.m128_f32[2] = pVec3.m128_f32[3] = 0.0f; } g_uiTicksStart = SysGetTickCount(); for( RTuint t = 0; t < ( ( RTuint )_RTNUMBER_OF_TESTS * 256 ); t++ ) { CrtDataPrefetchT0( pVec1 ); CrtDataPrefetchT0( pVec2 ); CrtDataPrefetchT0( pVec3 ); for( i = 0; i < _RTVECTOR_SIZE; i += 4 ) { pVec3[i ] = _mm_mul_ps( pVec1[i ], pVec2[i ] ); pVec3[i+1] = _mm_mul_ps( pVec1[i+1], pVec2[i+1] ); pVec3[i+2] = _mm_mul_ps( pVec1[i+2], pVec2[i+2] ); pVec3[i+3] = _mm_mul_ps( pVec1[i+3], pVec2[i+3] ); } } CrtPrintf( RTU("SSE-Based Vector Multiplication - %ld ticks\n"), ( RTint )( SysGetTickCount() - g_uiTicksStart ) ); if( pVec1 != RTnull ) _mm_free( pVec1 ); if( pVec2 != RTnull ) _mm_free( pVec2 ); if( pVec3 != RTnull ) _mm_free( pVec3 ); #endif //*/ } CrtPrintf( g_szMessageTestCompleted, ( RTint )( SysGetTickCount() - g_uiTicksStart ) ); CrtPrintf( g_szMessageTestEnd, 1143 ); } #endif [/cpp]

SergeyKostrov · ‎10-12-2012

Here is another tip. I've noticed that: [cpp] for( int i = 0; i < N; i+=4 ) // 1st form { a += ( b + b[i+1] + b[i+2] + b[i+3] ); } is faster then: for( int i = 0; i < N; i+=4 ) // 2nd form { a += b; a += b[i+1]; a += b[i+2]; a += b[i+3]; } [/cpp]

TimP · ‎10-12-2012

Sergey Kostrov wrote:

Here is another tip. I've noticed that:


   for( int i = 0; i < N; i+=4 )                         // 1st form

   {

      a += ( b + b[i+1] + b[i+2] + b[i+3] );

   }

is faster then:

   for( int i = 0; i < N; i+=4 )                         // 2nd form

   {

      a += b;

      a += b[i+1];

      a += b[i+2];

      a += b[i+3];

   }

If you allow a vectorizing compiler to optimize "riffling," you shouldn't have to be concerned about this. If you don't allow the compiler to perform such optimizations, the normal icc option -fp:fast can undo your efforts (and degrade accuracy). The point would be that when you add values sequentially in scalar mode, even if the result is registerized, you can add only once per 4 clock cycles or so.

SergeyKostrov · ‎10-12-2012

Hi Tim, Thanks for your comments. . >>... If you don't allow the compiler to perform such optimizations, . Yes, I don't allow it . >>the normal icc option -fp:fast can undo your efforts (and degrade accuracy). . A Floating Point Model 'Precise' ( /fp:precise ) is always used on all VS projects I have.

TimP · ‎10-14-2012

Yes, icc options such as /fp:precise or /fp:source should observe the order of addition specified in the source code, including disabling optimization of sum reduction. Unlike gcc and ifort, icc has no command line option for "vectorized" sum reduction while otherwise complying with language standards on evaluation order. This may be one of the reasons for the #pragma simd reduction (which over-rides fp:precise in that loop).

SergeyKostrov · ‎10-14-2012

>>...Does it tell the processor to "prefetch" data into cache? . Yes. There is a constant flow of posts redarding that subject on 'Software Tuning, Performance Optimization & Platform Monitoring' forum. . Web-link is: http://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring . A complete description of the instruction is in 'Instruction Set Reference Manual' volume 2B ( from N to Z ). During last 12 months I read lots of posts regarding that subject and there are two groups of developers with different point of views: . - It helps - It doesn't help . About 9 months ago I've done a set of tests and when a 'prefetcht0' was used it improved performance by 0.5 - 1.0%, However, it depends on an algorithm! In my case it was a Strassen Heap Based Complete algorithm for matrix multiplication. Since you have a VTune it will be much easier for you to detect if it helps or doesn't help. . Here is a very simple example: [cpp] ... int iArray[64] = { 0 }; int iSum = 0; _asm prefetcht0 [ iArray ] for( int i = 0; i < 64; i++ ) iSum += iArray; ... [/cpp]

SergeyKostrov · ‎10-16-2012

Hi 'FortCpp', I've created a simplified test-case that simulates your calculations. There is about 1% performance improvement if a Floating Point Model is changed from 'Precise' ( /fp:precise ) to 'Fast' ( /fp:fast ). It needs to be verified in your software if you decide to use 'Fast' Floating Point Model.

SergeyKostrov · ‎10-16-2012

>>...I don't think I will manually unroll the loop. But you are right, I should unroll the loop... As you can see in both cases test 'UnRolled Loops - 4-in-1 - B' has the best times. More information will be submitted soon. >> Deterministic Tests << >> Array size: 32MB 'double' elements << *** Set of Full Sum Tests *** Full Sum : Rolled Loops - 1-in-1 Sum is: 0.000000 Calculated in 609 ticks Full Sum : UnRolled Loops - 4-in-1 - A Sum is: 0.000000 Calculated in 609 ticks Full Sum : UnRolled Loops - 4-in-1 - B // <= Best Time ( ~23% faster than 1-in-1 test case ) Sum is: 0.000000 Calculated in 469 ticks Full Sum : UnRolled Loops - 8-in-1 - A Sum is: 0.000000 Calculated in 1172 ticks Full Sum : UnRolled Loops - 8-in-1 - B Sum is: 0.000000 Calculated in 844 ticks >> Non-Deterministic Tests << >> Array size: 32MB 'double' elements << *** Set of Full Sum Tests *** Full Sum : Rolled Loops - 1-in-1 Sum is: 0.000000 Calculated in 2219 ticks Full Sum : UnRolled Loops - 4-in-1 - A Sum is: 0.000000 Calculated in 2203 ticks Full Sum : UnRolled Loops - 4-in-1 - B // <= Best Time ( ~6% faster than 1-in-1 test case ) Sum is: 0.000000 Calculated in 2078 ticks Full Sum : UnRolled Loops - 8-in-1 - A Sum is: 0.000000 Calculated in 2860 ticks Full Sum : UnRolled Loops - 8-in-1 - B Sum is: 0.000000 Calculated in 2687 ticks

SergeyKostrov · ‎10-17-2012

Here is additional information. >>... >>double a = 0.0; >> >>for( j = 0; j < n; j++ ) >> a += w * fi[ ( (index + i) << ldf ) + k ]; >> >>To me it is just a dot product between w and fi. >>... Access to 'w' array is Sequential. However, access to 'fi' array could be Not Sequential ( could be a Random ) and it depends on a content of 'index' array. That is why some time ago I've asked a question about a distribution of data in 'index' array, I've simulated the Sequential and Random accesses to 'index' array and in the 2nd case the calculation of sum was almost 5 times slower.

SergeyKostrov · ‎10-17-2012

These are results of two tests that demonstrate the ranges of values I got when accessing 'index' array: *** Set of Full Sum Tests *** >> Test 1 - Array size: 4096 'double' elements << >> Sequential - Deterministic Processing starts << Index Value: 0 Index Value: 1 Index Value: 2 Index Value: 3 Index Value: 4 ... Index Value: 1340 Index Value: 1341 Index Value: 1342 Index Value: 1343 Index Value: 1344 ... Index Value: 2690 Index Value: 2691 Index Value: 2692 Index Value: 2693 Index Value: 2694 ... Index Value: 4091 Index Value: 4092 Index Value: 4093 Index Value: 4094 Index Value: 4095 >> Deterministic Processing ends << >> Test 2 - Array size: 4096 'double' elements << >> Random - Non-Deterministic Processing starts << Index Value: 3733 Index Value: 432 Index Value: 2213 Index Value: 1508 Index Value: 2045 ... Index Value: 3841 Index Value: 3760 Index Value: 269 Index Value: 2187 Index Value: 2194 ... Index Value: 1213 Index Value: 781 Index Value: 261 Index Value: 2983 Index Value: 1967 Index Value: 274 ... Index Value: 2171 Index Value: 3864 Index Value: 1573 Index Value: 1214 Index Value: 2577 >> Non-Deterministic Processing ends << If 'index' array is large ( let's say tens of megabytes ) and has random values in it than values from 'fi' array will be "taken" from different parts of memory on every iteration.

SergeyKostrov · ‎10-17-2012

I verified my own statement: >>... >>- raise a priority of the process / thread to Above Normal or High could improve performance by 0.5 - 1.0% ( do not change priority to Real Time ) >>,,, There is about 6% - 8% performance improvement if a process priority is changed to 'High' or 'Realtime' from 'Normal. Take into account that it needs to be verified in your software if you decide to use a priority boost. >> Array size: 32MB 'double' elements << >> Random - Non-Deterministic Tests << *** Set of Full Sum Tests *** Full Sum : Rolled Loops - 1-in-1 Sum is: 0.000000 Calculated in 2187 ticks Full Sum : UnRolled Loops - 4-in-1 - A Sum is: 0.000000 Calculated in 2172 ticks Full Sum : UnRolled Loops - 4-in-1 - B // <= Best Time ( without priority boost - ~6% faster than 1-in-1 test-case ) Sum is: 0.000000 Calculated in 2062 ticks Full Sum : UnRolled Loops - 8-in-1 - A Sum is: 0.000000 Calculated in 2844 ticks Full Sum : UnRolled Loops - 8-in-1 - B Sum is: 0.000000 Calculated in 2656 ticks Process Priority High Full Sum : UnRolled Loops - 4-in-1 - B Sum is: 0.000000 Calculated in 2047 ticks Process Priority Realtime Full Sum : UnRolled Loops - 4-in-1 - B // <= Best Time ( with priority boost - ~8% faster than 1-in-1 test-case ) Sum is: 0.000000 Calculated in 2016 ticks Take into account that large amounts of memory need to be pre-allocated before a priority change.

SergeyKostrov · ‎10-17-2012

And two more things: - In my tests I've simulated two types of accesses and didn't populate 'w' and 'fi' arrays with some random values. They were initialized with 0.0 values. That is why the sum is always: ... Sum is: 0.000000 ... - Let me know if you need source codes of my tests