Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Fast method to exchange values of two double-precision variables

SergeyKostrov
Valued Contributor II
460 Views
Let's say there are two double-precision variables declared as follows:

...
double dValueA = 55.55L;
double dValueB = 77.77L;
...

What is a fastest methodin assemblerto exchange values ofthese two double-precision variables?

Best regards,
Sergey
0 Kudos
5 Replies
Patrick_F_Intel1
Employee
460 Views
Hello Sergey,
Are you doing x87 math or SSE2 math?
Is this showing up as a bottleneck?
Pat
0 Kudos
SergeyKostrov
Valued Contributor II
460 Views
Hi Patrick,

Quoting Patrick Fay (Intel)
...
Are you doing x87 math or SSE2 math?

[SergeyK]x87 - Yes ( this is because a solution has to be highlyportable )
SSE2 solution also could be considered.

Is this showing up as a bottleneck?

[SergeyK] Yes, and I need to make the exchange in as fastest as possible way.

...


I'd like to provide some technical details. I don't need this to do the math butI need to use it inseveral sorting algorithms, like MergeSort, QuickSort, etc,
in cases when 'double' data types are used.

In ageneric form it looks like:

[cpp] ... dTemp = dValueA; dValueA = dValueB; dValueB = dTemp; ... [/cpp]


Here is a solution I currently implemented:

[cpp]#define HrtXchgDATATYPE_RTDOUBLE( dValueA, dValueB ) { _asm FLD [dValueA] _asm FLD [dValueB] _asm FSTP [dValueA] _asm FSTP [dValueB] } [/cpp]


Thesolution with FLD-FSTP instructions is~1.6x fasterand it improves performance of sorting algorithms.

Is it possible to make the exchange faster?

Best regards,
Sergey

0 Kudos
TimP
Honored Contributor III
460 Views
What is keeping the compiler from using portable source code to accomplish the same thing? Are you trying to include the cases of misaligned data? If not, wouldn't 128-bit parallel moves be preferable?
0 Kudos
SergeyKostrov
Valued Contributor II
460 Views
Quoting TimP (Intel)
What is keeping the compiler from using portable source code to accomplish the same thing?

[SergeyK] Nothing, but my toppriority is optimization of source codes in the first place.
It means thatcodes must be highlyoptimized at a C/C++ level,sometimes with inline assembler, and
I can't rely all the time onoptimizations of aC/C++ compiler.

Are you trying to include the cases of misaligned data? If not, wouldn't 128-bit parallel moves be preferable?

[SergeyK] Could you provide more technical details with an example?


Thanks in advance.

0 Kudos
SergeyKostrov
Valued Contributor II
460 Views
...
Thesolution with FLD-FSTP instructions is~1.6x fasterand it improves performance of sorting algorithms.

Is it possible to make the exchange faster?


I've done a set of tests with 'Load-Shuffle-Store' intrinsic functions, like

[cpp] ... RTdouble ddA[2] = { 77.0L, 55.0L }; __m128d ddV = { 0.0L, 0.0L }; ddV = _mm_loadu_pd( &ddA[0] ); ddV = _mm_shuffle_pd( ddV, ddV, 1 ); _mm_store_pd( &ddA[0], ddV ); ... [/cpp]


but it is not as fast as 'Fld-Fstp' based exchange. Finalrelative resultsof my tests are as follows:

Generic basedExchange- ~1.5x slower than Fld-Fstp
Fld-Fstp basedExchange - 1.0x
Shuffle basedExchange - ~2.5x slower than Fld-Fstp

0 Kudos
Reply