Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
38 Views

Matrix Transpose for char/short array

We can find the macro __MM_TRANSPOSE_PS for transpose of floats. But I am interested in doing transpose of an array of characters. I was able to write _MM_TRANSPOSE_PS myself using unpack and move intrinsics, but can't find similar intrinsics for chars.

Can anyone please help as to what approach should be taken in this situation.

Thanks
HG
0 Kudos
15 Replies
Highlighted
Beginner
38 Views

Depending on the array size PSHUFB can be a solution or at least be helpful doing it.
Please give an example.
0 Kudos
Highlighted
38 Views

Show us your declaration for the character array.
This will give us an idea of the size of the array and the number of dimensions.

Jim Dempsey

0 Kudos
Highlighted
Valued Contributor II
38 Views

An '_MM_TRANSPOSE4_PS' macro declared in 'xmmintrin.h' is designed to use a 4x4 matrix of floats
( single-precision ) as input.

In case of a similar approach for a 'char' type the biggest dimension will be 16x16, and for 'short' type it will be 8x8.

I'd like to repeatthe same question:

How big are your 'char' / 'short' matricies?

Best regards,
Sergey
0 Kudos
Highlighted
Valued Contributor II
38 Views

We can find the macro __MM_TRANSPOSE_PS for transpose of floats...

Does it make sense to use a SSE based transpose for a 4x4 matrix instead of a Classic algorithm?

Please take a look atresults of a test:

DEBUG configuration

> Test1028 Start <
Sub-Test 5 - 200,000,000 calls to [ CLASSIC 4x4 Matrix Transpose ] - 19657 ticks
Sub-Test 6 - 200,000,000 calls to [ SSE 4x4 Matrix Transpose ] - 8640 ticks // 2.28x faster
> Test1028 End <

RELEASE configuration

> Test1028 Start <
Sub-Test 5 - 200,000,000 calls to [ CLASSIC 4x4 Matrix Transpose ] - 18563 ticks
Sub-Test 6 - 200,000,000 calls to [ SSE 4x4 Matrix Transpose ] - 5843 ticks // 3.18x faster
> Test1028 End <

0 Kudos
Highlighted
38 Views

I am working on some benchmarks and generally taking sizes like 1k x 1k. shuffling the xmm registers seem the only posssible way which i dont think will give some good gains.
0 Kudos
Highlighted
Beginner
38 Views

0 Kudos
Highlighted
Valued Contributor II
38 Views

I am working on some benchmarks and generally taking sizes like 1k x 1k. shuffling the xmm registers seem the only posssible way which i dont think will give some good gains.


I couldprovide you with the performance numbers for two Matrix Transpose algorithms, applied to a
1K x 1K matrix,that I've implemented for my current project. That is,

- a Classic ( Two-For-Loops /Non-Inplace)

and

- a Diagonal Based( Two-For-Loops / Inplace )

The Diagonal Based algorithm doesn't need a second outputmatrix andhas areduced number of
exchanges. It never "touches" values along the diagonal line from left-top corner to right-bottom corner of the matrix.

0 Kudos
Highlighted
Valued Contributor II
38 Views

Please take a look at performance results.

Matrix size: 1,024 x 1,024

Classic Transpose - ( 128 transposes in 10.015 sec ) = 0.0782421875 sec
Diagonal Transpose - (128 transposes in 5.609 sec) = 0.0438203125 sec => ~1.79x faster

0 Kudos
Highlighted
Valued Contributor II
38 Views

Please take a look at results of another test.

If four __m128 variables:

...
__m128 row1 = { 0x0 };
__m128 row2 = { 0x0 };
__m128 row3 = { 0x0 };
__m128 row4 = { 0x0 };
...

initialized with characters as follows:

...
row1.m128_u8[ 0] = '0'; r1.m128_u8[ 1] = '1'; r1.m128_u8[ 2] = '2'; r1.m128_u8[ 3] = '3';
row1.m128_u8[ 4] = '4'; r1.m128_u8[ 5] = '5'; r1.m128_u8[ 6] = '6'; r1.m128_u8[ 7] = '7';
row1.m128_u8[ 8] = '8'; r1.m128_u8[ 9] = '9'; r1.m128_u8[10] = 'A'; r1.m128_u8[11] = 'B';
row1.m128_u8[12] = 'C'; r1.m128_u8[13] = 'D'; r1.m128_u8[14] = 'E'; r1.m128_u8[15] = 'F';
...
< the same for rows row2, row3 and row4 >
...

a Source Matrix ( as characters ) will look like:

0123456789ABCDEF
0123456789ABCDEF
0123456789ABCDEF
0123456789ABCDEF

and after a call to:

...
_MM_TRANSPOSE4_PS( row1, row2, row3, row4 );
...

a Transposed Matrix will look like:

0123012301230123
4567456745674567
89AB89AB89AB89AB
CDEFCDEFCDEFCDEF

This is wrong and there is nothing unusual here. The _MM_TRANSPOSE4_PS macro cannot be used for
transposing a 4x16 matrix of characters because it was designed to transpose a 4x4 matrix of floats.

0 Kudos
Highlighted
Valued Contributor II
38 Views

I am working on some benchmarks and generally taking sizes like 1k x 1k. shuffling the xmm registers seem the only posssible way which i dont think will give some good gains.


It would be interesting to see results of your R&D. Please provide some technical details and performance
numbersif you can.

Did you consideran Eklundh method of aMatrix Transpose?

Best regards,
Sergey

0 Kudos
Highlighted
38 Views

Thanks for your interest. i finally managed to do a good transpose using unpackepi8/16/32/64 instructions. its hard to give any numbers as transpose was a part of the actial problem. anyways i am intetested in eklundh method. what kind of numbets are reachable there?
0 Kudos
Highlighted
Valued Contributor II
38 Views

Thanks for your interest. i finally managed to do a good transpose using unpackepi8/16/32/64 instructions. its hard to give any numbers as transpose was a part of the actial problem...

It would nice to see a performance comparison of your SSE based algorithmwith a Classic algorithm.

The Eklundh method for a matrix transpose makes moreiterationsandmoreexchangescompared to a
Diagonal based algorithm.
0 Kudos
Highlighted
Valued Contributor II
38 Views

...anyways i am intetested in eklundh method. what kind of numbets are reachable there?...


Here is a comparisonof number of exchangesfordifferent algorithms. In case of an 8x8 matrix:

Classic-64 exchanges
Diagonal -28 exchanges
Eklundh - 48 exchanges

Take into account that forDiagonal and Eklundh algorithms an input matrix must be Square and both
algorithms areInplace (don't need an output matrix ).

0 Kudos
Highlighted
Beginner
38 Views

pshufb will only work for transposing 4x4 byte matrices.
I've attatched assembly code for x64 that will transpose 16x16 byte data very rapidly
It is based on using punpcklbw and punpckhbw to interleave data between 16x1 colomns stored in xmm registers.
To link it to your c++ project you will need to assemble the file (I use JWasm assembler), link the obj file to your project, then insert these lines to define it as a function.
#ifdef __cplusplus
extern "C" {
#endif
void Transpose16x16A(char* a, char* b);
at);
#ifdef __cplusplus
}
#endif
The basic idea is to interleave data between colomns. I will use a 4x4 matrix as an example
xmm0 xmm1 xmm2 xmm3
a0 b0 c0 d0
a1 b1 c1 d1
a2 b2 c2 d2
a3 b3 c3 d3
The MERGE macro in my code is defined like this:
MERGE MACRO FIRST, SECOND, TEMP
movdqa TEMP, FIRST
punpcklbw FIRST, SECOND
punpckhbw TEMP, SECOND
movdqa SECOND, TEMP
ENDM
It interleaves data from 2 colomns of the matrix. In our example
If we apply
MERGE xmm0, xmm2, xmm4
MERGE xmm1, xmm3, xmm4 we get:
xmm0 xmm1 xmm2 xmm3
a0 b0 a2 b2
c0 d0 c2 d2
a1 b1 a3 b3
c1 d1 c3 d3
The data has been moved across colomns and we are just one step away from the transpose.
Applying
MERGE xmm0, xmm1, xmm4
MERGE xmm2, xmm3, xmm4
we get:
xmm0 xmm1 xmm2 xmm3
a0 a1 a2 a3
b0 b1 b2 b3
c0 c1 c2 c3
d0 d1 d2 d3
the 16x16 transpose operates in the same way but requires more MERGE operations.
One thing to note is my code DOES NOT PERSERVE ANY XMM REGISTERS. The reason for this is that it is slow to save the registers and my code didn't need to. If you need to preserve registers, my suggestion would be to make a function that saves the registers, runs the 16x16 transpose in an assembly loop to generate a bunch of 16x16 transposes, and then restores the registers.
When I timed this algorithm (Intel core i2 processor) on byte data, I achieved speds of on average, 1/4 clock cycle per byte. That means the full 16x16 transpose should take roughly 64 clock cycles. I found this to be about 6x faster than doing the regular scalar algorithm (doublely nested loop with loop unrolling)
0 Kudos
Highlighted
Valued Contributor II
38 Views

That looks interesting and I'll try to test it (Visual Studio 2005\ ona 32-bit Windows platform ).

A couple of days ago I tested '_MM_TRANSPOSE4_PS' macro vs. 'No-For-Loops' codes ( just exchanges )
for a 4x4 matrix of floats and it outperforms the macro in a couple of times. I'll post results for comparison later.
0 Kudos