Solved: Hi,

Baptiste_W_ · ‎04-21-2015

Hello, I'm working with MKL 11.2.0.090 on Gentoo. I have an "Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz" processor. I'm trying to speed my inplace matrix transpositions and for that I thought that mkl_?imatcopy would be the solution. I have a very speedup on square matrix, but on rectangular matrix it is much worse than my naive "follow the cycles" implementation. Here is the call: mkl_dimatcopy('R', 'T', rows, cols, 1.0, matrix_ptr, rows, cols); When I profiled the executable, most of the cycles were spent in libmkl_avx.so [.] mkl_trans_avx_mkl_dimatcopy_mipt_t Am I doing something wrong or is simply the algorithm not good on rectangular matrix (I'd be surprised) ? Should I simply make an O(MN) space algorithm ? Thanks

Evarist_F_Intel · ‎04-22-2015

Hi,

You are right, mkl_?imatcopy is not optimized for non-square cases since even optimized case would much slower than out-of-place transposition. So in general, in such situations we usually either use mkl_?omatcopy or use gather-operation-scatter technique if it is suitable for algorithm (e.g. copy some block of data to the temporary buffer, perform needed operations and scatter the data back to its place -- this technique allows to reuse data in cache and generally improve the performance).

Square case is well optimized, since it is the case when mkl_?imatcopy can really help.

View solution in original post

Evarist_F_Intel · ‎04-22-2015

Hi,

You are right, mkl_?imatcopy is not optimized for non-square cases since even optimized case would much slower than out-of-place transposition. So in general, in such situations we usually either use mkl_?omatcopy or use gather-operation-scatter technique if it is suitable for algorithm (e.g. copy some block of data to the temporary buffer, perform needed operations and scatter the data back to its place -- this technique allows to reuse data in cache and generally improve the performance).

Square case is well optimized, since it is the case when mkl_?imatcopy can really help.

Baptiste_W_ · ‎04-22-2015

Hi,

Thanks for the quick answer :)

I will switch to copy/omatcopy for now. imatcopy has really impressive performances for square matrices.

Even if not fully optimized for the rectangular case, I would have expected better performance than my naive algorithm.

Slow rectangular matrix transposition ?