Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software Development Tools (Compilers, Debuggers, Profilers & Analyzers)
- Intel® C++ Compiler
- Performance implications of casting float to double?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Highlighted
##

Hello,

lkeene

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-20-2010
11:35 AM

271 Views

Performance implications of casting float to double?

what are the performance implications of casting floats to doubles? We have a tight inner loop wherein most of our work is being done, and in this loop we are accumulating results. Right now our accumulation variable is type float, but out of fear that we may overflow somehow or lose too much precision we decided to switch the accumulation var to double and cast the floating point results to double before adding to the accumulation var. To my amazement the execution time tripled. Is there a faster way to do this? Thanks in advance.

9 Replies

Highlighted
##

Without knowing the range of numbers being accumulated, it is difficult to provide a "best" solution.

Assume the tight loop is accumulating results from 1,000,000 iterations.

Assume your input data is of format n.nnn0000

After 1,000 iterations your accumulator might contain n,nnn.nnn (7 digits of data)

If you continue to add into this accumulator you will, through round-off, be dropping the last significant digit(s) in the source data. However, if you were to:

on each aniversary of 1000 iteratons (and after last iteration)

{

convert accumulator to double and sum into grand total accumulator

then zero out low precision accumulator and resume accumulation

}

Thenan almost equivilentdouble precision result will be produced at a very low additional overhead.

The frequency of the grand total would depend on the values in the input data.

Jim Dempsey

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-20-2010
12:37 PM

271 Views

Assume the tight loop is accumulating results from 1,000,000 iterations.

Assume your input data is of format n.nnn0000

After 1,000 iterations your accumulator might contain n,nnn.nnn (7 digits of data)

If you continue to add into this accumulator you will, through round-off, be dropping the last significant digit(s) in the source data. However, if you were to:

on each aniversary of 1000 iteratons (and after last iteration)

{

convert accumulator to double and sum into grand total accumulator

then zero out low precision accumulator and resume accumulation

}

Thenan almost equivilentdouble precision result will be produced at a very low additional overhead.

The frequency of the grand total would depend on the values in the input data.

Jim Dempsey

Highlighted
##

lkeene

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-20-2010
02:31 PM

271 Views

That's a very clever trick, I hadn't thought of that! Thank you, I'll try it.

Highlighted
##

Jim proposes an interesting solution to the dilemma, which should avoid significant performance loss.

icc vectorized float accumulation already uses 4 independent partial sums, so there is a little more protection than in a single float sum. An openmp sum reduction would introduce additional independent partial sums.

Double accumulation of results from float operations might be more efficient in x87 code than in scalar SSE code, but x87 rules out vectorization. Various tactics might be used to get x87 code, possibly including defining the sum as long double, casting operands to long double, and setting corresponding compile options. I don't recommend this, if Jim's idea works.

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-20-2010
03:19 PM

271 Views

icc vectorized float accumulation already uses 4 independent partial sums, so there is a little more protection than in a single float sum. An openmp sum reduction would introduce additional independent partial sums.

Double accumulation of results from float operations might be more efficient in x87 code than in scalar SSE code, but x87 rules out vectorization. Various tactics might be used to get x87 code, possibly including defining the sum as long double, casting operands to long double, and setting corresponding compile options. I don't recommend this, if Jim's idea works.

Highlighted
##

If your code is not vectorised then the float computation is done using doble internally by the hardware. You may consider using double instead of float. This will notdegrade performance. It may increase the memory requrement and program size though.

Om_S_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-21-2010
01:38 AM

271 Views

Highlighted
##

>>icc vectorized float accumulation already uses 4 independent partial sums, so there is a little more protection than in a single float sum. An openmp sum reduction would introduce additional independent partial sums.

This holds true when the sum-total produced is a float.

When the sum-total is to be a double (from floats)then the trick is how to reduce the number of float to double conversions (while keeping SEE on floats) and eventually producing a double as sum-total (and while keeping precision). The technique I outlined earlier trades off performance against precision in a tunable manner. (tunable on the frequency the grand totaling is performed).

The vector sum of floats to doubles might be a useful extention to AVX

IOW small vector of floatsfrom memory accumulated into 2 xmm registers as doubles

How useful this would be, I am in no position to tell. Although RAM prices are falling and you could change your stored data from float to double with little cost ($) you still have the memory bandwidth issue in that the floats will fetch twice as fast as the doubles. Speed matters.

Jim Dempsey

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-21-2010
05:54 AM

271 Views

This holds true when the sum-total produced is a float.

When the sum-total is to be a double (from floats)then the trick is how to reduce the number of float to double conversions (while keeping SEE on floats) and eventually producing a double as sum-total (and while keeping precision). The technique I outlined earlier trades off performance against precision in a tunable manner. (tunable on the frequency the grand totaling is performed).

The vector sum of floats to doubles might be a useful extention to AVX

IOW small vector of floatsfrom memory accumulated into 2 xmm registers as doubles

How useful this would be, I am in no position to tell. Although RAM prices are falling and you could change your stored data from float to double with little cost ($) you still have the memory bandwidth issue in that the floats will fetch twice as fast as the doubles. Speed matters.

Jim Dempsey

Highlighted
##

Here is a variant of Jim Dempsey's idea, which avoids issues of SSE/MMX/8087, etc.

In place of his double-precision grand accumulator, use a poor-man's integer counter for float overflow.

Use a float threshhold value, say 1e7.

After a number of iterations of your algorithm, say, every 100 iterations, check if your float accumulator contains a value greater than THRESHHOLD. If so, increment the poor-man's counter, and subtract THRESHHOLD from the float accumulator.

At the end, find the grand accumulator value as (poor man's counter contents) X THRESHHOLD + float accumulator contents.

mecej4

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-26-2010
11:57 AM

271 Views

In place of his double-precision grand accumulator, use a poor-man's integer counter for float overflow.

Use a float threshhold value, say 1e7.

After a number of iterations of your algorithm, say, every 100 iterations, check if your float accumulator contains a value greater than THRESHHOLD. If so, increment the poor-man's counter, and subtract THRESHHOLD from the float accumulator.

At the end, find the grand accumulator value as (poor man's counter contents) X THRESHHOLD + float accumulator contents.

Highlighted
##

JenniferJ

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-26-2010
02:29 PM

271 Views

There might be something the compiler is doing bad when you cast the float to double originally. is it possible for you to send me a testcase or code snippet? (use private if prefer.)

I'm just hoping that we could improve the compiler so itmight benifit all.

thanks,

Jennifer

Highlighted
##

>>After a number of iterations of your algorithm, say, every 100 iterations, check if your float accumulator contains a value greater than THRESHHOLD. If so, increment the poor-man's counter, and subtract THRESHHOLD from the float accumulator.

Why do the check?

It is faster (and just as accurate) to extract the float int portion of the number, add to higher precsion accumulator, and subtract from running precision accumulator. This assumes the end result is float or at least the end result has fewer than 7/8 digits following ".".

IOW adding 0.0is faster than testing for and branching around

This should be code-able all in SSE using all 4 packed floats.

You would have better precision in the lsb but the end resultsare still SP or in DP if using

double result = (double)high + (double)low;

This will give you up to 7 digits to left of "." and 6/7 digits to right of "."

I am not an expert on the SSE (or intrinsics) I did not see a convert 4 SP floats to 4 SP float integers in one instruction (it may be there) but you can do 4 SP floats to 4 int 32s but the document I have does not show a conversion the other way (because they do not know how to handle potential overflow).

An alternate way to a 4-wide int functionfor SP floating point (result 4 floats with no fraction)

(assuming all positive numbers)

add a 4-wide literal (kept in register) with each containing 2^24.

then immediatly subtracting the same literal.

The first add will flush out any fraction bits, the subtract with remove the 2^24

(use 2^53 for DP hack)

*** caution ***

the above will work provided the internal archetecture of the SSE does not change to maintain residual roundoff bits (similar to FPU instructions).

When (if) than happens, you may need to add an additional move (register to register). I do not forsee this change happening.

Jim Dempsey

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-27-2010
09:30 AM

271 Views

Why do the check?

It is faster (and just as accurate) to extract the float int portion of the number, add to higher precsion accumulator, and subtract from running precision accumulator. This assumes the end result is float or at least the end result has fewer than 7/8 digits following ".".

IOW adding 0.0is faster than testing for and branching around

This should be code-able all in SSE using all 4 packed floats.

You would have better precision in the lsb but the end resultsare still SP or in DP if using

double result = (double)high + (double)low;

This will give you up to 7 digits to left of "." and 6/7 digits to right of "."

I am not an expert on the SSE (or intrinsics) I did not see a convert 4 SP floats to 4 SP float integers in one instruction (it may be there) but you can do 4 SP floats to 4 int 32s but the document I have does not show a conversion the other way (because they do not know how to handle potential overflow).

An alternate way to a 4-wide int functionfor SP floating point (result 4 floats with no fraction)

(assuming all positive numbers)

add a 4-wide literal (kept in register) with each containing 2^24.

then immediatly subtracting the same literal.

The first add will flush out any fraction bits, the subtract with remove the 2^24

(use 2^53 for DP hack)

*** caution ***

the above will work provided the internal archetecture of the SSE does not change to maintain residual roundoff bits (similar to FPU instructions).

When (if) than happens, you may need to add an additional move (register to register). I do not forsee this change happening.

Jim Dempsey

Highlighted
##

I forgot to mention that although I did not see the 4xint32 to 4x float conversion, that if the int numbers are well behaved (positive and .lt. 2^24) then, the 4x int32, when useas 4x floats, will appear as denormalized numbers. The SSE should handle these and convert to SP float automagically. Negative int32 numbers will require a negate and or with sign bit (still restriction on negated value being less than 2^24).

Jim Dempsey

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

05-27-2010
09:40 AM

271 Views

Jim Dempsey

For more complete information about compiler optimizations, see our Optimization Notice.