- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I recently learned about the ORDER input parameter of the RESHAPE intrinsic function, and happily started using it to rearrange indices of multi-dimensional arrays that were previously done by DO-loops.
Today, I benchmarked RESHAPE's performance against simple DO-loops, and surprisingly found quite some drawbacks of RESHAPE that concern me.
- RESHAPE creates temporary array, which causes stack overflow when dealing with large arrays. I had to turn on -heap-arrays. The same would not happen with DO-loops or a subroutine (instead of a function).
- Without compiler optimization, RESHAPE is slightly fast than DO-loops. With compiler optimization, RESHAPE is MUCH slower than DO-loops (especially with -heap-arrays on), like... more than FOUR times slower!
- This slow down is significant even when RESHAPE is coupled to a much more costly operation, such as matrix multiplication.
In particular, I compared the performance of the following procedure (pseudocode)
A = RESHAPE(X); B = RESHAPE(Y); CALL GEMM(A,B,C); Z = RESHAPE(C)
and its DO-loops equivalent.
While GEMM takes up most of the running time, using RESHAPE instead of DO-loops still gives me a 40% slow down(!!!), which I consider non-negligible.
So my questions are
- Is the creation of temporary array expected? Is there any way to avoid this temporary array or is it an inherent problem with functions (as opposed to subroutines)?
- Is there any way for RESHAPE to gain speed up from compiler optimization in a similar way DO-loops do?
- Is there any more optimized procedure(s) (external library welcomed) that can do what RESHAPE does (including the "order" feature)? Similar to how ?gemm in BLAS (mkl) is a more optimized alternative for MATMUL.
Thanks!
The following is the core part of my benchmark code:
! permute indices using RESHAPE and compute matrix product start = dsecnd() A = reshape( X, shape = shape(A), order = (/1,3,2,4/)) B = reshape( Y, shape = shape(A), order = (/3,2,1,4/)) call dgemm('n', 'n', d*d,d*d, d*d, 1.0d0, A, d*d, B, d*d, 0.0d0, C, d*d) Z1 = reshape( C, shape = shape(Z1), order = (/2,4,1,3/)) finish = dsecnd() time(1) = time(1) + (finish - start) ! permute indices using DO-loop and compute matrix product start = dsecnd() do x4 = 1, d do x3 = 1, d do x2 = 1, d do x1 = 1, d A(x1,x2,x3,x4) = X(x1,x3,x2,x4) B(x1,x2,x3,x4) = Y(x3,x2,x1,x4) enddo enddo enddo enddo call dgemm('n', 'n', d*d,d*d, d*d, 1.0d0, A, d*d, B, d*d, 0.0d0, C, d*d) do x4 = 1, d do x3 = 1, d do x2 = 1, d do x1 = 1, d Z2(x1,x2,x3,x4) = C(x2,x4,x1,x3) enddo enddo enddo enddo finish = dsecnd() time(2) = time(2) + (finish - start)
Timing results:
d d^d operation total time total time repeated (times) using RESHAPE (sec) using DO-loop (sec) 8 64 100 0.0442 0.0078 9 81 100 0.0382 0.0148 10 100 100 0.0561 0.0220 11 121 100 0.0867 0.0385 12 144 100 0.0837 0.0358 13 169 100 0.1215 0.0562 14 196 100 0.1393 0.0650 15 225 100 0.1468 0.0808 16 256 100 0.2121 0.1005 17 289 100 0.2787 0.1694 18 324 100 0.3292 0.1904 19 361 100 0.4361 0.2869 20 400 100 0.5803 0.3663 21 441 100 0.7818 0.5412 22 484 100 1.0171 0.6280 23 529 100 1.4412 0.9693 24 576 100 2.0233 1.3788 25 625 100 1.8849 1.2557 26 676 100 2.9092 2.0290 27 729 100 3.0685 2.0632 28 784 100 4.4147 3.1926 29 841 100 5.8867 4.4470 30 900 100 5.2322 3.3947 31 961 100 7.4737 5.2590 32 1024 100 9.2567 5.8486 33 1089 100 8.9164 5.9614 34 1156 83 10.0240 7.0497 35 1225 63 10.1190 6.6005 36 1296 55 10.2040 7.2455
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why are you not organizing your arrays in an optimal order in which they can be used? IOW reduce the use of RESHAP to that minimally required.
There is absolutely no practical reason to maintain left-to-right array index order in source code between, say, C and Fortran. Keep the data blob in the order that is most effective for the calculations (SIMD friendly). If one side uses indices order X, Y, Z and the other side uses indices order Z, Y, X and can do so without performing a reshaping copy, how can you then justify the use of RESHAPE "just because the source code looks similar" is not a good enough justification. The C/C++ side typically is not using hardwired multi-dimension arrays. Rather, C/C++ typically use an array of pointers. It is more effective to perform the "reshape" on the C/C++ side by construction the obverse array of pointers. This exchanges the overhead of creating essentially a duplicate array, copying all the cells of the array (size(array) number of copies), to construct a pointer array of size(array)/size(row). And most likely eliminates constructing the pointer array on second and subsequent calls if done correctly.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
Why are you not organizing your arrays in an optimal order in which they can be used? IOW reduce the use of RESHAP to that minimally required.
There is absolutely no practical reason to maintain left-to-right array index order in source code between, say, C and Fortran. Keep the data blob in the order that is most effective for the calculations (SIMD friendly). If one side uses indices order X, Y, Z and the other side uses indices order Z, Y, X and can do so without performing a reshaping copy, how can you then justify the use of RESHAPE "just because the source code looks similar" is not a good enough justification. The C/C++ side typically is not using hardwired multi-dimension arrays. Rather, C/C++ typically use an array of pointers. It is more effective to perform the "reshape" on the C/C++ side by construction the obverse array of pointers. This exchanges the overhead of creating essentially a duplicate array, copying all the cells of the array (size(array) number of copies), to construct a pointer array of size(array)/size(row). And most likely eliminates constructing the pointer array on second and subsequent calls if done correctly.
Jim Dempsey
The code I posted is for testing only, and purposefully arranged the indices in a non-optimal way.
For my actual code, which deals with large number of consecutive tensor contractions, rearranging indices is unavoidable if I want to use ?GEMM in BLAS.
My question has nothing to do with C/C++.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The cost of RESHAPE, in the way you are using it, is approximately equivalent to the cost of a TRANSPOSE (though it can be lessor or greater depending on index orders). Using as outlined in you example program, this cost will occur twice. Therefore, its use is only suitable where the computation time of the routine called in GEMM or BLAS is relatively high (e.g. matrix multiply). For the lower computational routines, you will be better writing your own (or locating routines for your own index layout).
>>my actual code, which deals with large number of consecutive tensor contractions
Here is an alternate approach for you to consider:
The RESHAPE is memory bandwidth bound, (hopefully) the GEMM or BLAS is floating point calculation bound. Seeing that you have a large number of these tensor contractions, it might be advantageous to use two threads (preferably same core) as a team in a form of a pipeline (or double buffer if you prefer) where one thread performs the RESHAPEs into the pipeline (double buffer) and the other thread extracts the references to the reshaped array an performs the calls to GEMM or BLAS. In this manner the memory latencies of the RESHAPE are available for the computation of the GEMM or BLAS call. Depending on the tensor size the RESHAPE may or may not need to use non-temporal stores and/or run on a different core from the compute thread. This is a relatively easy tuning issue.
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page