- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

...

double dValueA = 55.55L;

double dValueB = 77.77L;

...

What is a

**fastest**methodin assemblerto exchange values ofthese two double-precision variables?

Best regards,

Sergey

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Are you doing x87 math or SSE2 math?

Is this showing up as a bottleneck?

Pat

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

*...*

Are you doing x87 math or SSE2 math?

[

Are you doing x87 math or SSE2 math?

**SergeyK**]

**x87**- Yes ( this is because a solution has to be highlyportable )

**SSE2**solution also could be considered.

*Is this showing up as a bottleneck?*

[

**SergeyK**] Yes, and I need to make the exchange in as fastest as possible way.

*...*

I'd like to provide some technical details. I don't need this to do the math butI need to use it inseveral sorting algorithms, like **MergeSort**, **QuickSort**, etc,

in cases when '**double**' data types are used.

In ageneric form it looks like:

Here is a solution I currently implemented:

Thesolution with **FLD-FSTP** instructions is~1.6x fasterand it improves performance of sorting algorithms.

Is it possible to make the exchange faster?

Best regards,

Sergey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

*What is keeping the compiler from using portable source code to accomplish the same thing?*

[

**SergeyK**] Nothing, but my toppriority is optimization of source codes in the first place.

It means thatcodes must be highlyoptimized at a C/C++ level,sometimes with inline assembler, and

I can't rely all the time onoptimizations of aC/C++ compiler.

*Are you trying to include the cases of misaligned data? If not, wouldn't 128-bit parallel moves be preferable?*

[

**SergeyK**] Could you provide more technical details with an example?

Thanks in advance.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

*...*

Thesolution with

Is it possible to make the exchange faster?

Thesolution with

**FLD-FSTP**instructions is~1.6x fasterand it improves performance of sorting algorithms.Is it possible to make the exchange faster?

I've done a set of tests with '**Load-Shuffle-Store**' intrinsic functions, like

but it is not as fast as '**Fld-Fstp**' based exchange. Finalrelative resultsof my tests are as follows:

Generic basedExchange- ~**1.5x** slower than Fld-Fstp

Fld-Fstp basedExchange - **1.0x**

Shuffle basedExchange - ~**2.5x** slower than Fld-Fstp

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page