RESHAPE performance issues

Chenfeng_B_ · ‎10-26-2015

I recently learned about the ORDER input parameter of the RESHAPE intrinsic function, and happily started using it to rearrange indices of multi-dimensional arrays that were previously done by DO-loops.

Today, I benchmarked RESHAPE's performance against simple DO-loops, and surprisingly found quite some drawbacks of RESHAPE that concern me.

RESHAPE creates temporary array, which causes stack overflow when dealing with large arrays. I had to turn on -heap-arrays. The same would not happen with DO-loops or a subroutine (instead of a function).
Without compiler optimization, RESHAPE is slightly fast than DO-loops. With compiler optimization, RESHAPE is MUCH slower than DO-loops (especially with -heap-arrays on), like... more than FOUR times slower!
This slow down is significant even when RESHAPE is coupled to a much more costly operation, such as matrix multiplication.

In particular, I compared the performance of the following procedure (pseudocode)
A = RESHAPE(X); B = RESHAPE(Y); CALL GEMM(A,B,C); Z = RESHAPE(C)
and its DO-loops equivalent.
While GEMM takes up most of the running time, using RESHAPE instead of DO-loops still gives me a 40% slow down(!!!), which I consider non-negligible.

So my questions are

Is the creation of temporary array expected? Is there any way to avoid this temporary array or is it an inherent problem with functions (as opposed to subroutines)?
Is there any way for RESHAPE to gain speed up from compiler optimization in a similar way DO-loops do?
Is there any more optimized procedure(s) (external library welcomed) that can do what RESHAPE does (including the "order" feature)? Similar to how ?gemm in BLAS (mkl) is a more optimized alternative for MATMUL.

Thanks!

The following is the core part of my benchmark code:

! permute indices using RESHAPE and compute matrix product
start = dsecnd()
A  = reshape( X, shape = shape(A),  order = (/1,3,2,4/))
B  = reshape( Y, shape = shape(A),  order = (/3,2,1,4/))
call dgemm('n', 'n', d*d,d*d, d*d, 1.0d0, A, d*d, B, d*d, 0.0d0, C, d*d)
Z1 = reshape( C, shape = shape(Z1), order = (/2,4,1,3/))
finish = dsecnd()
time(1) = time(1) + (finish - start)

! permute indices using DO-loop and compute matrix product
start = dsecnd()
do x4 = 1, d
do x3 = 1, d
do x2 = 1, d
do x1 = 1, d
    A(x1,x2,x3,x4) = X(x1,x3,x2,x4)
    B(x1,x2,x3,x4) = Y(x3,x2,x1,x4)
enddo
enddo
enddo
enddo
call dgemm('n', 'n', d*d,d*d, d*d, 1.0d0, A, d*d, B, d*d, 0.0d0, C, d*d)
do x4 = 1, d
do x3 = 1, d
do x2 = 1, d
do x1 = 1, d
    Z2(x1,x2,x3,x4) = C(x2,x4,x1,x3)
enddo
enddo
enddo
enddo
finish = dsecnd()
time(2) = time(2) + (finish - start)

Timing results:

d     d^d      operation            total time            total time
             repeated (times)    using RESHAPE (sec)    using DO-loop (sec)
8     64         100                 0.0442                 0.0078
9     81         100                 0.0382                 0.0148
10    100        100                 0.0561                 0.0220
11    121        100                 0.0867                 0.0385
12    144        100                 0.0837                 0.0358
13    169        100                 0.1215                 0.0562
14    196        100                 0.1393                 0.0650
15    225        100                 0.1468                 0.0808
16    256        100                 0.2121                 0.1005
17    289        100                 0.2787                 0.1694
18    324        100                 0.3292                 0.1904
19    361        100                 0.4361                 0.2869
20    400        100                 0.5803                 0.3663
21    441        100                 0.7818                 0.5412
22    484        100                 1.0171                 0.6280
23    529        100                 1.4412                 0.9693
24    576        100                 2.0233                 1.3788
25    625        100                 1.8849                 1.2557
26    676        100                 2.9092                 2.0290
27    729        100                 3.0685                 2.0632
28    784        100                 4.4147                 3.1926
29    841        100                 5.8867                 4.4470
30    900        100                 5.2322                 3.3947
31    961        100                 7.4737                 5.2590
32    1024       100                 9.2567                 5.8486
33    1089       100                 8.9164                 5.9614
34    1156       83                 10.0240                 7.0497
35    1225       63                 10.1190                 6.6005
36    1296       55                 10.2040                 7.2455

jimdempseyatthecove · ‎10-27-2015

Why are you not organizing your arrays in an optimal order in which they can be used? IOW reduce the use of RESHAP to that minimally required.

There is absolutely no practical reason to maintain left-to-right array index order in source code between, say, C and Fortran. Keep the data blob in the order that is most effective for the calculations (SIMD friendly). If one side uses indices order X, Y, Z and the other side uses indices order Z, Y, X and can do so without performing a reshaping copy, how can you then justify the use of RESHAPE "just because the source code looks similar" is not a good enough justification. The C/C++ side typically is not using hardwired multi-dimension arrays. Rather, C/C++ typically use an array of pointers. It is more effective to perform the "reshape" on the C/C++ side by construction the obverse array of pointers. This exchanges the overhead of creating essentially a duplicate array, copying all the cells of the array (size(array) number of copies), to construct a pointer array of size(array)/size(row). And most likely eliminates constructing the pointer array on second and subsequent calls if done correctly.

Jim Dempsey

Chenfeng_B_ · ‎10-27-2015

jimdempseyatthecove wrote:

Why are you not organizing your arrays in an optimal order in which they can be used? IOW reduce the use of RESHAP to that minimally required.

There is absolutely no practical reason to maintain left-to-right array index order in source code between, say, C and Fortran. Keep the data blob in the order that is most effective for the calculations (SIMD friendly). If one side uses indices order X, Y, Z and the other side uses indices order Z, Y, X and can do so without performing a reshaping copy, how can you then justify the use of RESHAPE "just because the source code looks similar" is not a good enough justification. The C/C++ side typically is not using hardwired multi-dimension arrays. Rather, C/C++ typically use an array of pointers. It is more effective to perform the "reshape" on the C/C++ side by construction the obverse array of pointers. This exchanges the overhead of creating essentially a duplicate array, copying all the cells of the array (size(array) number of copies), to construct a pointer array of size(array)/size(row). And most likely eliminates constructing the pointer array on second and subsequent calls if done correctly.

Jim Dempsey

The code I posted is for testing only, and purposefully arranged the indices in a non-optimal way.

For my actual code, which deals with large number of consecutive tensor contractions, rearranging indices is unavoidable if I want to use ?GEMM in BLAS.

My question has nothing to do with C/C++.

jimdempseyatthecove · ‎10-28-2015

The cost of RESHAPE, in the way you are using it, is approximately equivalent to the cost of a TRANSPOSE (though it can be lessor or greater depending on index orders). Using as outlined in you example program, this cost will occur twice. Therefore, its use is only suitable where the computation time of the routine called in GEMM or BLAS is relatively high (e.g. matrix multiply). For the lower computational routines, you will be better writing your own (or locating routines for your own index layout).

>>my actual code, which deals with large number of consecutive tensor contractions

Here is an alternate approach for you to consider:

The RESHAPE is memory bandwidth bound, (hopefully) the GEMM or BLAS is floating point calculation bound. Seeing that you have a large number of these tensor contractions, it might be advantageous to use two threads (preferably same core) as a team in a form of a pipeline (or double buffer if you prefer) where one thread performs the RESHAPEs into the pipeline (double buffer) and the other thread extracts the references to the reshaped array an performs the calls to GEMM or BLAS. In this manner the memory latencies of the RESHAPE are available for the computation of the GEMM or BLAS call. Depending on the tensor size the RESHAPE may or may not need to use non-temporal stores and/or run on a different core from the compute thread. This is a relatively easy tuning issue.

Jim Dempsey