How to optimize "Matrix Transposition" in x64 or x86 platform

Shaquille_W_1 · ‎07-08-2015

I have difficulties about optimizing "Matrix Transposition"

The details as following:

I want to transpose gray image(8 bit) to 90 degree, 180 degree, 270 degree

Especially 90 and 270, if you want to transpose image data to 90 and 270, you must exchange row and col.and, you can just exchange 1 bytes in each loop kernel,

it is very difficulty to obtain high performance.

So, I want to get help from Intel's experts?

Is there anyone would help me?

Thank you very much

McCalpinJohn · ‎07-09-2015

Matrix transposition can be tricky, but the tricks are quite different for different matrix sizes. For very small matrices that fit in L1 cache, it is important to vectorize the code. Vectorization of small (e.g., 8x8) transpose operations is discussed in an Intel AVX white paper and in the Intel Software Optimization reference guide. The code there discusses 32-bit and 64-bit data if I recall correctly, but many of the same principles apply to 8-bit or 16-bit data. For these smaller data types the AVX2 instruction set support in the Haswell (and later) processors should be helpful, though optimization at this level is fairly hard work. The general principle is that you need to balance the use of the "shuffle" instructions on Port 5 with re-loading the data at different offsets (and exploiting the limited data rearrangement capabilities available in Port 0 and Port 1 instructions). For AVX/AVX2 there are 32 8-bit values in a 256-bit SIMD register, so there will not be enough registers to do a full 32x32 transpose in registers. Lots of data will have to be loaded multiple times. It is pretty much essential to review all of the available instructions to see what capabilities the processor has for the data type of interest.

For large matrices (i.e., bigger than L2 cache), the key is to avoid cache associativity conflicts. Since the L1 and L2 caches are 8-way set-associative, it is very helpful to do blocking or tiling or unrolling to a degree of no more than 8. If I recall correctly, I had pretty good results by simply adding a "#pragma unroll(8)" directive on the *outer* loop of a simple transposition code.

I don't have my test results handy, but you may find that there are significant differences in performance between an implementation with contiguous loads and an implementation with contiguous stores. You will want to test both.

For very large matrices (much bigger than L3), you will want to do sub-blocking for L3 and TLB re-use, and you may want to try using 2 MiB pages.

Shaquille_W_1 · ‎07-09-2015

Hello, John D. McCalpin

First, thanks for your replying and suggestion

Now, I have some quesion about your suggestion.

1.As you mentioned:

the general principle is that you need to balance the use of the "shuffle" instructions on Port 5 with re-loading the data at different offsets (and exploiting the limited data rearrangement capabilities available in Port 0 and Port 1 instructions)ou

what is Port 5, Port 0 and Port 1?

2.As you mentioned:

Since the L1 and L2 caches are 8-way set-associative, it is very helpful to do blocking or tiling or unrolling to a degree of no more than 8

what is "do blocking" and "tiling" ?

3. As you mentioned:

For very large matrices (much bigger than L3), you will want to do sub-blocking for L3 and TLB re-use, and you may want to try using 2 MiB pages.

what is "do sub-blocking" and "TLB re-use", and how can I "using 2 MiB pages" ?

Would you give me some more further details? Thank you very much

McCalpinJohn · ‎07-23-2015

To learn about the execution ports on Intel processors, the primary reference is Chapter 2 of the "Intel 64 and IA-32 Architectures Optimization Reference Manual" (Intel document 248966, revision 030, September 2014).
The issues related to AVX optimization are in Chapter 11 of the same manual, including Section 11.11 "Handling Port 5 Pressure".
Wikipedia has an introduction to Loop Tiling https://en.wikipedia.org/wiki/Loop_tiling
"Sub-blocking" is a generalization of multidimensional loop unrolling (https://en.wikipedia.org/wiki/Loop_unrolling), converting a single loop into a double loop nest. When applied to loops that are already nested, it results in operations being performed in arbitrarily-sized multidimensional blocks.
One of the reasons that sub-blocking is desirable is that it improves re-use of the TLB. The TLB is described at https://en.wikipedia.org/wiki/Translation_lookaside_buffer.
The "page size" refers to the size of the block of memory that is mapped by a single page translation. You can read more at https://en.wikipedia.org/wiki/Page_%28computer_memory%29.

Understanding these issues is a multi-year undertaking. By the time you understand the issues in relation to current systems, those systems will have become obsolete and will have been replaced by much more complex systems. Given the amount of effort required, you would almost certainly be better off looking up the "transpose" functions in the Intel Performance Primitives library (https://software.intel.com/en-us/intel-ipp).