Fast method to exchange values of two double-precision variables

SergeyKostrov · ‎03-30-2012

Let's say there are two double-precision variables declared as follows:

...
double dValueA = 55.55L;
double dValueB = 77.77L;
...

What is a fastest methodin assemblerto exchange values ofthese two double-precision variables?

Best regards,
Sergey

Patrick_F_Intel1 · ‎03-31-2012

Hello Sergey,
Are you doing x87 math or SSE2 math?
Is this showing up as a bottleneck?
Pat

SergeyKostrov · ‎04-01-2012

Hi Patrick,

Quoting Patrick Fay (Intel)

...
Are you doing x87 math or SSE2 math?

[SergeyK]x87 - Yes ( this is because a solution has to be highlyportable )
SSE2 solution also could be considered.

Is this showing up as a bottleneck?

[SergeyK] Yes, and I need to make the exchange in as fastest as possible way.

...

I'd like to provide some technical details. I don't need this to do the math butI need to use it inseveral sorting algorithms, like MergeSort, QuickSort, etc,
in cases when 'double' data types are used.

In ageneric form it looks like:

[cpp] ... dTemp = dValueA; dValueA = dValueB; dValueB = dTemp; ... [/cpp]

Here is a solution I currently implemented:

[cpp]#define HrtXchgDATATYPE_RTDOUBLE( dValueA, dValueB ) { _asm FLD [dValueA] _asm FLD [dValueB] _asm FSTP [dValueA] _asm FSTP [dValueB] } [/cpp]

Thesolution with FLD-FSTP instructions is~1.6x fasterand it improves performance of sorting algorithms.

Is it possible to make the exchange faster?

Best regards,
Sergey

TimP · ‎04-01-2012

What is keeping the compiler from using portable source code to accomplish the same thing? Are you trying to include the cases of misaligned data? If not, wouldn't 128-bit parallel moves be preferable?

SergeyKostrov · ‎04-01-2012

Quoting TimP (Intel)

What is keeping the compiler from using portable source code to accomplish the same thing?

[SergeyK] Nothing, but my toppriority is optimization of source codes in the first place.
It means thatcodes must be highlyoptimized at a C/C++ level,sometimes with inline assembler, and
I can't rely all the time onoptimizations of aC/C++ compiler.

Are you trying to include the cases of misaligned data? If not, wouldn't 128-bit parallel moves be preferable?

[SergeyK] Could you provide more technical details with an example?

Thanks in advance.

SergeyKostrov · ‎04-05-2012

Quoting Sergey Kostrov

...
Thesolution with FLD-FSTP instructions is~1.6x fasterand it improves performance of sorting algorithms.

Is it possible to make the exchange faster?

I've done a set of tests with 'Load-Shuffle-Store' intrinsic functions, like

[cpp] ... RTdouble ddA[2] = { 77.0L, 55.0L }; __m128d ddV = { 0.0L, 0.0L }; ddV = _mm_loadu_pd( &ddA[0] ); ddV = _mm_shuffle_pd( ddV, ddV, 1 ); _mm_store_pd( &ddA[0], ddV ); ... [/cpp]

but it is not as fast as 'Fld-Fstp' based exchange. Finalrelative resultsof my tests are as follows:

Generic basedExchange- ~1.5x slower than Fld-Fstp
Fld-Fstp basedExchange - 1.0x
Shuffle basedExchange - ~2.5x slower than Fld-Fstp