- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Can anyone please help as to what approach should be taken in this situation.

Thanks

HG

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Please give an example.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

This will give us an idea of the size of the array and the number of dimensions.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**_MM_TRANSPOSE4_PS**' macro declared in '

**xmmintrin.h**' is designed to use a

**4x4**matrix of

**floats**

( single-precision ) as input.

In case of a similar approach for a '

**char**' type the biggest dimension will be

**16x16**, and for '

**short**' type it will be

**8x8**.

I'd like to repeatthe same question:

**How big are your 'char' / 'short' matricies?**

Best regards,

Sergey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

*We can find the macro*

**__MM_TRANSPOSE_PS**for transpose of floats...Does it make sense to use a **SSE** based transpose for a **4x4** matrix instead of a **Classic** algorithm?

Please take a look atresults of a test:

**DEBUG configuration**

> Test1028 Start <

Sub-Test 5 - 200,000,000 calls to [ **CLASSIC** 4x4 Matrix Transpose ] - **19657** ticks

Sub-Test 6 - 200,000,000 calls to [ **SSE** 4x4 Matrix Transpose ] - **8640** ticks // **2.28**x faster

> Test1028 End <

**RELEASE configuration**

> Test1028 Start <

Sub-Test 5 - 200,000,000 calls to [ **CLASSIC** 4x4 Matrix Transpose ] - **18563** ticks

Sub-Test 6 - 200,000,000 calls to [ **SSE** 4x4 Matrix Transpose ] - **5843** ticks // **3.18**x faster

> Test1028 End <

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

*I am working on some benchmarks and generally taking sizes like*

**1k x 1k**. shuffling the xmm registers seem the only posssible way which i dont think will give some good gains.

I couldprovide you with the performance numbers for two **Matrix Transpose** algorithms, applied to a**1K x 1K** matrix,that I've implemented for my current project. That is,

- a **Classic** ( Two-For-Loops /Non-Inplace)

and

- a **Diagonal Based**( Two-For-Loops / Inplace )

The **Diagonal Based** algorithm doesn't need a second outputmatrix andhas areduced number of

exchanges. It never "touches" values along the diagonal line from left-top corner to right-bottom corner of the matrix.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Please take a look at performance results.

Matrix size: **1,024 x 1,024**

**Classic** Transpose - ( 128 transposes in 10.015 sec ) = 0.0782421875 sec**Diagonal** Transpose - (128 transposes in 5.609 sec) = 0.0438203125 sec => ~**1.79x** faster

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Please take a look at results of another test.

If four **__m128** variables:

...

__m128 row1 = { 0x0 };

__m128 row2 = { 0x0 };

__m128 row3 = { 0x0 };

__m128 row4 = { 0x0 };

...

initialized with characters as follows:

...

row1.m128_u8[ 0] = '0'; r1.m128_u8[ 1] = '1'; r1.m128_u8[ 2] = '2'; r1.m128_u8[ 3] = '3';

row1.m128_u8[ 4] = '4'; r1.m128_u8[ 5] = '5'; r1.m128_u8[ 6] = '6'; r1.m128_u8[ 7] = '7';

row1.m128_u8[ 8] = '8'; r1.m128_u8[ 9] = '9'; r1.m128_u8[10] = 'A'; r1.m128_u8[11] = 'B';

row1.m128_u8[12] = 'C'; r1.m128_u8[13] = 'D'; r1.m128_u8[14] = 'E'; r1.m128_u8[15] = 'F';

...

< the same for rows row2, row3 and row4 >

...

a **Source Matrix** ( as characters ) will look like:

0123456789ABCDEF

0123456789ABCDEF

0123456789ABCDEF

0123456789ABCDEF

and after a call to:

...

**_MM_TRANSPOSE4_PS**( row1, row2, row3, row4 );

...

a **Transposed Matrix** will look like:

0123012301230123

4567456745674567

89AB89AB89AB89AB

CDEFCDEFCDEFCDEF

This is wrong and there is nothing unusual here. The **_MM_TRANSPOSE4_PS** macro cannot be used for

transposing a **4x16** matrix of characters because it was designed to transpose a **4x4** matrix of floats.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

*I am working on some benchmarks and generally taking sizes like 1k x 1k.*

**shuffling the xmm registers seem the only posssible way which i dont think will give some good gains**.

It would be interesting to see results of your R&D. Please provide some technical details and performance

numbersif you can.

Did you consideran **Eklundh** method of aMatrix Transpose?

Best regards,

Sergey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

*Thanks for your interest. i finally managed to do a good transpose using unpackepi8/16/32/64 instructions. its hard to give any numbers as transpose was a part of the actial problem...*

It would nice to see a performance comparison of your

**SSE**based algorithmwith a

**Classic**algorithm.

The

**Eklundh**method for a matrix transpose makes moreiterationsandmoreexchangescompared to a

**Diagonal**based algorithm.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

*...anyways i am intetested in eklundh method. what kind of numbets are reachable there?...*

Here is a comparisonof number of exchangesfordifferent algorithms. In case of an **8x8** matrix:

**Classic**-64 exchanges

**Diagonal** -28 exchanges

**Eklundh** - 48 exchanges

Take into account that for**Diagonal** and **Eklundh** algorithms an input matrix must be Square and both

algorithms areInplace (don't need an output matrix ).

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**Visual Studio 2005**\ ona

**32-bit**Windows platform ).

A couple of days ago I tested '

**_MM_TRANSPOSE4_PS**' macro vs. '

**No-For-Loops**' codes ( just exchanges )

for a

**4x4**matrix of floats and it outperforms the macro in a couple of times. I'll post results for comparison later.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page