Matrix Transpose for char/short array

gautam_himanshu · ‎03-02-2012

We can find the macro __MM_TRANSPOSE_PS for transpose of floats. But I am interested in doing transpose of an array of characters. I was able to write _MM_TRANSPOSE_PS myself using unpack and move intrinsics, but can't find similar intrinsics for chars.

Can anyone please help as to what approach should be taken in this situation.

Thanks
HG

sirrida · ‎03-02-2012

Depending on the array size PSHUFB can be a solution or at least be helpful doing it.
Please give an example.

jimdempseyatthecove · ‎03-02-2012

Show us your declaration for the character array.
This will give us an idea of the size of the array and the number of dimensions.

Jim Dempsey

SergeyKostrov · ‎03-02-2012

An '_MM_TRANSPOSE4_PS' macro declared in 'xmmintrin.h' is designed to use a 4x4 matrix of floats
( single-precision ) as input.

In case of a similar approach for a 'char' type the biggest dimension will be 16x16, and for 'short' type it will be 8x8.

I'd like to repeatthe same question:

How big are your 'char' / 'short' matricies?

Best regards,
Sergey

SergeyKostrov · ‎03-03-2012

Quoting gautam.himanshu

We can find the macro __MM_TRANSPOSE_PS for transpose of floats...

Does it make sense to use a SSE based transpose for a 4x4 matrix instead of a Classic algorithm?

Please take a look atresults of a test:

DEBUG configuration

> Test1028 Start <
Sub-Test 5 - 200,000,000 calls to [ CLASSIC 4x4 Matrix Transpose ] - 19657 ticks
Sub-Test 6 - 200,000,000 calls to [ SSE 4x4 Matrix Transpose ] - 8640 ticks // 2.28x faster
> Test1028 End <

RELEASE configuration

> Test1028 Start <
Sub-Test 5 - 200,000,000 calls to [ CLASSIC 4x4 Matrix Transpose ] - 18563 ticks
Sub-Test 6 - 200,000,000 calls to [ SSE 4x4 Matrix Transpose ] - 5843 ticks // 3.18x faster
> Test1028 End <

gautam_himanshu · ‎03-04-2012

I am working on some benchmarks and generally taking sizes like 1k x 1k. shuffling the xmm registers seem the only posssible way which i dont think will give some good gains.

styc · ‎03-04-2012

How about using IPP? http://software.intel.com/sites/products/documentation/hpc/ipp/ippi/ippi_ch4/functn_Transpose.html

SergeyKostrov · ‎03-05-2012

Quoting gautam.himanshu

I am working on some benchmarks and generally taking sizes like 1k x 1k. shuffling the xmm registers seem the only posssible way which i dont think will give some good gains.

I couldprovide you with the performance numbers for two Matrix Transpose algorithms, applied to a
1K x 1K matrix,that I've implemented for my current project. That is,

- a Classic ( Two-For-Loops /Non-Inplace)

and

- a Diagonal Based( Two-For-Loops / Inplace )

The Diagonal Based algorithm doesn't need a second outputmatrix andhas areduced number of
exchanges. It never "touches" values along the diagonal line from left-top corner to right-bottom corner of the matrix.

SergeyKostrov · ‎03-05-2012

Please take a look at performance results.

Matrix size: 1,024 x 1,024

Classic Transpose - ( 128 transposes in 10.015 sec ) = 0.0782421875 sec
Diagonal Transpose - (128 transposes in 5.609 sec) = 0.0438203125 sec => ~1.79x faster

SergeyKostrov · ‎03-05-2012

Please take a look at results of another test.

If four __m128 variables:

...
__m128 row1 = { 0x0 };
__m128 row2 = { 0x0 };
__m128 row3 = { 0x0 };
__m128 row4 = { 0x0 };
...

initialized with characters as follows:

...
row1.m128_u8[ 0] = '0'; r1.m128_u8[ 1] = '1'; r1.m128_u8[ 2] = '2'; r1.m128_u8[ 3] = '3';
row1.m128_u8[ 4] = '4'; r1.m128_u8[ 5] = '5'; r1.m128_u8[ 6] = '6'; r1.m128_u8[ 7] = '7';
row1.m128_u8[ 8] = '8'; r1.m128_u8[ 9] = '9'; r1.m128_u8[10] = 'A'; r1.m128_u8[11] = 'B';
row1.m128_u8[12] = 'C'; r1.m128_u8[13] = 'D'; r1.m128_u8[14] = 'E'; r1.m128_u8[15] = 'F';
...
< the same for rows row2, row3 and row4 >
...

a Source Matrix ( as characters ) will look like:

0123456789ABCDEF
0123456789ABCDEF
0123456789ABCDEF
0123456789ABCDEF

and after a call to:

...
_MM_TRANSPOSE4_PS( row1, row2, row3, row4 );
...

a Transposed Matrix will look like:

0123012301230123
4567456745674567
89AB89AB89AB89AB
CDEFCDEFCDEFCDEF

This is wrong and there is nothing unusual here. The _MM_TRANSPOSE4_PS macro cannot be used for
transposing a 4x16 matrix of characters because it was designed to transpose a 4x4 matrix of floats.

SergeyKostrov · ‎03-06-2012

Quoting gautam.himanshu

I am working on some benchmarks and generally taking sizes like 1k x 1k. shuffling the xmm registers seem the only posssible way which i dont think will give some good gains.

It would be interesting to see results of your R&D. Please provide some technical details and performance
numbersif you can.

Did you consideran Eklundh method of aMatrix Transpose?

Best regards,
Sergey

gautam_himanshu · ‎03-09-2012

Thanks for your interest. i finally managed to do a good transpose using unpackepi8/16/32/64 instructions. its hard to give any numbers as transpose was a part of the actial problem. anyways i am intetested in eklundh method. what kind of numbets are reachable there?

SergeyKostrov · ‎03-09-2012

Quoting gautam.himanshu

Thanks for your interest. i finally managed to do a good transpose using unpackepi8/16/32/64 instructions. its hard to give any numbers as transpose was a part of the actial problem...

It would nice to see a performance comparison of your SSE based algorithmwith a Classic algorithm.

The Eklundh method for a matrix transpose makes moreiterationsandmoreexchangescompared to a
Diagonal based algorithm.

SergeyKostrov · ‎03-11-2012

Quoting gautam.himanshu

...anyways i am intetested in eklundh method. what kind of numbets are reachable there?...

Here is a comparisonof number of exchangesfordifferent algorithms. In case of an 8x8 matrix:

Classic-64 exchanges
Diagonal -28 exchanges
Eklundh - 48 exchanges

Take into account that forDiagonal and Eklundh algorithms an input matrix must be Square and both
algorithms areInplace (don't need an output matrix ).

nick_1234 · ‎03-13-2012

pshufb will only work for transposing 4x4 byte matrices.

I've attatched assembly code for x64 that will transpose 16x16 byte data very rapidly

It is based on using punpcklbw and punpckhbw to interleave data between 16x1 colomns stored in xmm registers.

To link it to your c++ project you will need to assemble the file (I use JWasm assembler), link the obj file to your project, then insert these lines to define it as a function.

#ifdef __cplusplus

extern "C" {

#endif

void Transpose16x16A(char* a, char* b);

at);

#ifdef __cplusplus

}

#endif

The basic idea is to interleave data between colomns. I will use a 4x4 matrix as an example

xmm0 xmm1 xmm2 xmm3

a0 b0 c0 d0

a1 b1 c1 d1

a2 b2 c2 d2

a3 b3 c3 d3

The MERGE macro in my code is defined like this:

MERGE MACRO FIRST, SECOND, TEMP

movdqa TEMP, FIRST

punpcklbw FIRST, SECOND

punpckhbw TEMP, SECOND

movdqa SECOND, TEMP

ENDM

It interleaves data from 2 colomns of the matrix. In our example

If we apply

MERGE xmm0, xmm2, xmm4

MERGE xmm1, xmm3, xmm4 we get:

xmm0 xmm1 xmm2 xmm3

a0 b0 a2 b2

c0 d0 c2 d2

a1 b1 a3 b3

c1 d1 c3 d3

The data has been moved across colomns and we are just one step away from the transpose.

Applying

MERGE xmm0, xmm1, xmm4

MERGE xmm2, xmm3, xmm4

we get:

xmm0 xmm1 xmm2 xmm3

a0 a1 a2 a3

b0 b1 b2 b3

c0 c1 c2 c3

d0 d1 d2 d3

the 16x16 transpose operates in the same way but requires more MERGE operations.

One thing to note is my code DOES NOT PERSERVE ANY XMM REGISTERS. The reason for this is that it is slow to save the registers and my code didn't need to. If you need to preserve registers, my suggestion would be to make a function that saves the registers, runs the 16x16 transpose in an assembly loop to generate a bunch of 16x16 transposes, and then restores the registers.

When I timed this algorithm (Intel core i2 processor) on byte data, I achieved speds of on average, 1/4 clock cycle per byte. That means the full 16x16 transpose should take roughly 64 clock cycles. I found this to be about 6x faster than doing the regular scalar algorithm (doublely nested loop with loop unrolling)

SergeyKostrov · ‎03-14-2012

That looks interesting and I'll try to test it (Visual Studio 2005\ ona 32-bit Windows platform ).

A couple of days ago I tested '_MM_TRANSPOSE4_PS' macro vs. 'No-For-Loops' codes ( just exchanges )
for a 4x4 matrix of floats and it outperforms the macro in a couple of times. I'll post results for comparison later.