Community
cancel
Showing results for 
Search instead for 
Did you mean: 
SergeyKostrov
Valued Contributor II
57 Views

Fast method to exchange values of two double-precision variables

Let's say there are two double-precision variables declared as follows:

...
double dValueA = 55.55L;
double dValueB = 77.77L;
...

What is a fastest methodin assemblerto exchange values ofthese two double-precision variables?

Best regards,
Sergey
0 Kudos
5 Replies
Patrick_F_Intel1
Employee
57 Views

Hello Sergey,
Are you doing x87 math or SSE2 math?
Is this showing up as a bottleneck?
Pat
SergeyKostrov
Valued Contributor II
57 Views

Hi Patrick,

Quoting Patrick Fay (Intel)
...
Are you doing x87 math or SSE2 math?

[SergeyK]x87 - Yes ( this is because a solution has to be highlyportable )
SSE2 solution also could be considered.

Is this showing up as a bottleneck?

[SergeyK] Yes, and I need to make the exchange in as fastest as possible way.

...


I'd like to provide some technical details. I don't need this to do the math butI need to use it inseveral sorting algorithms, like MergeSort, QuickSort, etc,
in cases when 'double' data types are used.

In ageneric form it looks like:

[cpp] ... dTemp = dValueA; dValueA = dValueB; dValueB = dTemp; ... [/cpp]


Here is a solution I currently implemented:

[cpp]#define HrtXchgDATATYPE_RTDOUBLE( dValueA, dValueB ) { _asm FLD [dValueA] _asm FLD [dValueB] _asm FSTP [dValueA] _asm FSTP [dValueB] } [/cpp]


Thesolution with FLD-FSTP instructions is~1.6x fasterand it improves performance of sorting algorithms.

Is it possible to make the exchange faster?

Best regards,
Sergey

TimP
Black Belt
57 Views

What is keeping the compiler from using portable source code to accomplish the same thing? Are you trying to include the cases of misaligned data? If not, wouldn't 128-bit parallel moves be preferable?
SergeyKostrov
Valued Contributor II
57 Views

Quoting TimP (Intel)
What is keeping the compiler from using portable source code to accomplish the same thing?

[SergeyK] Nothing, but my toppriority is optimization of source codes in the first place.
It means thatcodes must be highlyoptimized at a C/C++ level,sometimes with inline assembler, and
I can't rely all the time onoptimizations of aC/C++ compiler.

Are you trying to include the cases of misaligned data? If not, wouldn't 128-bit parallel moves be preferable?

[SergeyK] Could you provide more technical details with an example?


Thanks in advance.

SergeyKostrov
Valued Contributor II
57 Views

...
Thesolution with FLD-FSTP instructions is~1.6x fasterand it improves performance of sorting algorithms.

Is it possible to make the exchange faster?


I've done a set of tests with 'Load-Shuffle-Store' intrinsic functions, like

[cpp] ... RTdouble ddA[2] = { 77.0L, 55.0L }; __m128d ddV = { 0.0L, 0.0L }; ddV = _mm_loadu_pd( &ddA[0] ); ddV = _mm_shuffle_pd( ddV, ddV, 1 ); _mm_store_pd( &ddA[0], ddV ); ... [/cpp]


but it is not as fast as 'Fld-Fstp' based exchange. Finalrelative resultsof my tests are as follows:

Generic basedExchange- ~1.5x slower than Fld-Fstp
Fld-Fstp basedExchange - 1.0x
Shuffle basedExchange - ~2.5x slower than Fld-Fstp

Reply