Notes about Loop-Blocking Optimization Technique to increase performance of processing

SergeyKostrov · ‎06-13-2013

[ Note 1 ]

Loop-Blocking Optimization Technique is well described in Intel Software Development Manual and Intel C++ compiler User and Reference Guides. After extensive testing I could say that it is very important to select a right Block Size for the last for-loop and its optimal size depends on a size of L1 cache line of a CPU.

SergeyKostrov · ‎06-13-2013

[ Note 1 - part 2 ] Here are some performance numbers: ... Sub-Test - Adds array B to array A - Loop-Blocking Optimization Technique ( Single-Precision Floating-Point type ) ... [ Block size 16 elements ( 64 bytes ) ] ... ICC - Block Size: 16 - Sub-Test completed in 1516 ticks ( T1 ) MSC - Block Size: 16 - Sub-Test completed in 1531 ticks ( T2 ) MGW - Block Size: 16 - Sub-Test completed in 1546 ticks ( T3 ) ... [ Block size 32 elements ( 128 bytes ) ] ... ICC - Block Size: 32 - Sub-Test completed in 6422 ticks MSC - Block Size: 32 - Sub-Test completed in 6671 ticks MGW - Block Size: 32 - Sub-Test completed in 6734 ticks ... [ Block size 64 elements ( 256 bytes ) ] ... ICC - Block Size: 64 - Sub-Test completed in 6593 ticks MSC - Block Size: 64 - Sub-Test completed in 6735 ticks MGW - Block Size: 64 - Sub-Test completed in 6765 ticks ...

SergeyKostrov · ‎06-13-2013

[ Note 2 ] Unrolling, or Vectorization, for the last for-loop as 4-in-1 improves performance by ~2.3% ( it is average for three C++ compilers I tested ): ... [ Block size 16 elements ( 16 bytes ) ] ... ICC - Block Size: 16 - Sub-Test completed in 1500 ticks Note: Faster by ~1% compared to T1 ( see previous post for all Tx values ) MSC - Block Size: 16 - Sub-Test completed in 1516 ticks Note: Faster by ~1% compared to T2 MGW - Block Size: 16 - Sub-Test completed in 1469 ticks Note: Faster by ~5% compared to T3 ... [ Note 3 ] It is possible that Loop-Blocking Optimization Technique won't improve performance of some processing for very large data sets, if they are greater than GBs when loaded into memory, and when Virtual Memory ( paging file ) is used. [ Note 4 ] ICC - Intel C++ compiler MSC - Microsoft C++ compiler MGW - MinGW C++ compiler [ Note 5 ] Optimizations for speed ( /O2 ) used for all C++ compilers.

SergeyKostrov · ‎06-15-2013

[ Note 6 ] There is a very good article about the Loop-Blocking Optimization Technique at: . http://software.intel.com/en-us/articles/performance-tools-for-software-developers-loop-blocking [ Note 7 ] Here is another set of performance numbers ( with Intel C++ compiler ): Sub-Test - MatrixA + MatrixB - Loop-Blocking Optimization Technique Matrix Size: 4096x4096 ... Block Size : 2 elements ( 8 bytes ) - Sub-Test completed in 1265 ticks Block Size : 4 elements ( 16 bytes ) - Sub-Test completed in 625 ticks Block Size : 8 elements ( 32 bytes ) - Sub-Test completed in 485 ticks Block Size : 16 elements ( 64 bytes ) - Sub-Test completed in 469 ticks <- Best Time Block Size : 32 elements ( 128 bytes ) - Sub-Test completed in 1678 ticks Block Size : 64 elements ( 256 bytes ) - Sub-Test completed in 1682 ticks Block Size : 128 elements ( 512 bytes ) - Sub-Test completed in 1687 ticks ...

Bernard · ‎06-16-2013

By Loop - Blocking Optimization techniques do you mean dividing data block int cache lines (32-bytes) long and inner loop iteration on every double or float value(inside cache line)?

Here is an example:

void arrayAdditionTest2(double (*input)[MAX_SIZE],double (*output)[MAX_SIZE]){
    double _in[MAX_SIZE][MAX_SIZE],_out[MAX_SIZE][MAX_SIZE],result[MAX_SIZE][MAX_SIZE];
    double (*res)[MAX_SIZE];
    if(input == NULL || output == NULL)
        return;

    input = _in;
    output = _out;
    res = result;

    for(int i = 0;i < MAX_SIZE;i++){
        for(int j = 0;j < MAX_SIZE;j++){
            printf("array input[] = %.17f \n",*(*(input+i)+j));
        }
    }


                for(int i = 0;i < MAX_SIZE;i+=CACHE_LINE){
             for(int j = 0;j < MAX_SIZE;j+=CACHE_LINE){
            for(int ii = i;ii <i + CACHE_LINE;ii++){
                for(int jj = j;jj <j + CACHE_LINE;jj++){
                    *(*(res+ii)+jj) = *(*(output+ii)+jj) + *(*(input+ii)+jj);
                    printf("Loop Blocking test2 = %.17f %.17f \n",*(*(res+ii)+jj));
                }
            }
        }
    }


}

SergeyKostrov · ‎06-16-2013

See Note 1 and Note 6.

Bernard · ‎06-16-2013

Thanks.

SergeyKostrov · ‎06-18-2013

[ Note 8 ] There are some recommendations related to -fomit-frame-pointer -fprefetch-loop-arrays command line options for MinGW C++ compiler in order to improve performance of processing. However, I did not see any performance gains ( especially for -fprefetch-loop-arrays ) when I tried to use both options.

TimP · ‎06-18-2013

According to gcc docs, -fomit-frame-pointer is implied by -O, for cases where it is possible (-g would turn it off). It seems it would be important mainly for 32-bit mode.

IIRC -fprefetch-loop-arrays was designed for AMD athlon-32 CPUs. On any current CPU, it could be useful only for specialized cases, such as where the limit on hardware prefetched streams is exceeded, or DTLB misses can be mitigated without premature cache eviction.

SergeyKostrov · ‎06-18-2013

Thanks, Tim, for these comments. >>According to gcc docs, -fomit-frame-pointer is implied by -O, for cases where it is possible (-g would turn it off). >>It seems it would be important mainly for 32-bit mode. I tested it with a test application compiled for Release configuration and here is a part of command line options for the compiler: ...-O2 -m32 -ffast-math -fomit-frame-pointer... and I use -g option for Debug configuration only: ...-O0 -m32 -g... >>IIRC -fprefetch-loop-arrays was designed for AMD athlon-32 CPUs... That's good to know and I'll check docs as well.

Bernard · ‎06-18-2013

>>>fomit-frame-pointer>>>

Is that option used to load ebp register with arbitrary data?So call stack frames are accessed with esp register.

SergeyKostrov · ‎06-19-2013

[ Note 9 ] Here are results of tests on: Dell Precision Mobile M4700 Intel Core i7-3840QM ( Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 ) Size of L3 Cache = 8MB ( shared between all cores for data & instructions ) Size of L2 Cache = 1MB ( 256KB per core / shared for data & instructions ) Size of L1 Cache = 256KB ( 32KB per core for data & 32KB per core for instructions ) Windows 7 Professional 64-bit Two versions of Classic matrix multiplication algorithm tested: - Transposed Based - Loop-Blocking Optimization Based Matrix Size: 2048x2048 Block Size : 2 elements ( 8 bytes ) Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in 1325 ticks Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in 29781 ticks Block Size : 4 elements ( 16 bytes ) Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in 1326 ticks Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in 9142 ticks Block Size : 8 elements ( 32 bytes ) Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in 1339 ticks Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in 3978 ticks Block Size : 16 elements ( 64 bytes ) Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in 1329 ticks Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in 2637 ticks Block Size : 32 elements ( 128 bytes ) Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in 1336 ticks Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in 2585 ticks Block Size : 64 elements ( 256 bytes ) Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in 1321 ticks Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in 2543 ticks Block Size : 128 elements ( 512 bytes ) Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in 1323 ticks Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in 2465 ticks Block Size : 256 elements ( 1024 bytes ) Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in 1326 ticks Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in 2418 ticks Block Size : 512 elements ( 2048 bytes ) Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in 1326 ticks Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in 2371 ticks Block Size : 1024 elements ( 4096 bytes ) Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in 1321 ticks Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in 2372 ticks Block Size : 2048 elements ( 8192 bytes ) Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in 1326 ticks Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in 2356 ticks 1. Since cache lines are larger than smaller sizes for the Block Size parameter don't increase performance of calculations 2. It is important to note that Classic matrix multiplication Transposed Based algorithm always outperforms Classic matrix multiplication Loop-Blocking Optimization Based algorithm

SergeyKostrov · ‎06-19-2013

[ Summary for Loop-Blocking Optimization Technique on Ivy Bridge ] Matrix Size: 2048x2048 Block Size : ....2 elements ( ......8 bytes ) - Completed in 29781 ticks Block Size : ....4 elements ( .....16 bytes ) - Completed in 9142 ticks Block Size : ....8 elements ( .....32 bytes ) - Completed in 3978 ticks Block Size : ...16 elements ( ....64 bytes ) - Completed in 2637 ticks Block Size : ...32 elements ( ..128 bytes ) - Completed in 2585 ticks Block Size : ...64 elements ( ..256 bytes ) - Completed in 2543 ticks Block Size : ..128 elements ( ..512 bytes ) - Completed in 2465 ticks Block Size : ..256 elements ( 1024 bytes ) - Completed in 2418 ticks Block Size : ..512 elements ( 2048 bytes ) - Completed in 2371 ticks Block Size : 1024 elements ( 4096 bytes ) - Completed in 2372 ticks Block Size : 2048 elements ( 8192 bytes ) - Completed in 2356 ticks

SergeyKostrov · ‎06-19-2013

>>...Is that option used to load ebp register with arbitrary data? This is what MSDN says about it: ... This option speeds function calls, because no frame pointers need to be set up and removed. It also frees one more register, (EBP on the Intel 386 or later) for storing frequently used variables and sub-expressions. /Oy is only available in x86 compilers. ...

Bernard · ‎06-19-2013

Yep that true, but FPO complicates debugging.

SergeyKostrov · ‎06-20-2013

I don't understand the note about Debugging and it is Not relevant to the subject of the thread.

Bernard · ‎06-20-2013

FPO = frame pointer omittion.

Sorry for offtopic post.