Community
cancel
Showing results for
Did you mean:
Highlighted
Beginner
271 Views

## Performance implications of casting float to double?

Hello,
what are the performance implications of casting floats to doubles? We have a tight inner loop wherein most of our work is being done, and in this loop we are accumulating results. Right now our accumulation variable is type float, but out of fear that we may overflow somehow or lose too much precision we decided to switch the accumulation var to double and cast the floating point results to double before adding to the accumulation var. To my amazement the execution time tripled. Is there a faster way to do this? Thanks in advance.
9 Replies
Highlighted
Black Belt
271 Views
Without knowing the range of numbers being accumulated, it is difficult to provide a "best" solution.

Assume the tight loop is accumulating results from 1,000,000 iterations.
Assume your input data is of format n.nnn0000
After 1,000 iterations your accumulator might contain n,nnn.nnn (7 digits of data)
If you continue to add into this accumulator you will, through round-off, be dropping the last significant digit(s) in the source data. However, if you were to:

on each aniversary of 1000 iteratons (and after last iteration)
{
convert accumulator to double and sum into grand total accumulator
then zero out low precision accumulator and resume accumulation
}

Thenan almost equivilentdouble precision result will be produced at a very low additional overhead.

The frequency of the grand total would depend on the values in the input data.

Jim Dempsey
Highlighted
Beginner
271 Views
That's a very clever trick, I hadn't thought of that! Thank you, I'll try it.
Highlighted
Black Belt
271 Views
Jim proposes an interesting solution to the dilemma, which should avoid significant performance loss.
icc vectorized float accumulation already uses 4 independent partial sums, so there is a little more protection than in a single float sum. An openmp sum reduction would introduce additional independent partial sums.

Double accumulation of results from float operations might be more efficient in x87 code than in scalar SSE code, but x87 rules out vectorization. Various tactics might be used to get x87 code, possibly including defining the sum as long double, casting operands to long double, and setting corresponding compile options. I don't recommend this, if Jim's idea works.
Highlighted
Employee
271 Views
If your code is not vectorised then the float computation is done using doble internally by the hardware. You may consider using double instead of float. This will notdegrade performance. It may increase the memory requrement and program size though.

Highlighted
Black Belt
271 Views
>>icc vectorized float accumulation already uses 4 independent partial sums, so there is a little more protection than in a single float sum. An openmp sum reduction would introduce additional independent partial sums.

This holds true when the sum-total produced is a float.

When the sum-total is to be a double (from floats)then the trick is how to reduce the number of float to double conversions (while keeping SEE on floats) and eventually producing a double as sum-total (and while keeping precision). The technique I outlined earlier trades off performance against precision in a tunable manner. (tunable on the frequency the grand totaling is performed).

The vector sum of floats to doubles might be a useful extention to AVX
IOW small vector of floatsfrom memory accumulated into 2 xmm registers as doubles

How useful this would be, I am in no position to tell. Although RAM prices are falling and you could change your stored data from float to double with little cost (\$) you still have the memory bandwidth issue in that the floats will fetch twice as fast as the doubles. Speed matters.

Jim Dempsey
Highlighted
Black Belt
271 Views
Here is a variant of Jim Dempsey's idea, which avoids issues of SSE/MMX/8087, etc.

In place of his double-precision grand accumulator, use a poor-man's integer counter for float overflow.

Use a float threshhold value, say 1e7.

After a number of iterations of your algorithm, say, every 100 iterations, check if your float accumulator contains a value greater than THRESHHOLD. If so, increment the poor-man's counter, and subtract THRESHHOLD from the float accumulator.

At the end, find the grand accumulator value as (poor man's counter contents) X THRESHHOLD + float accumulator contents.
Highlighted
Moderator
271 Views

How is the performance with the modified code?

There might be something the compiler is doing bad when you cast the float to double originally. is it possible for you to send me a testcase or code snippet? (use private if prefer.)

I'm just hoping that we could improve the compiler so itmight benifit all.

thanks,
Jennifer

Highlighted
Black Belt
271 Views
>>After a number of iterations of your algorithm, say, every 100 iterations, check if your float accumulator contains a value greater than THRESHHOLD. If so, increment the poor-man's counter, and subtract THRESHHOLD from the float accumulator.

Why do the check?

It is faster (and just as accurate) to extract the float int portion of the number, add to higher precsion accumulator, and subtract from running precision accumulator. This assumes the end result is float or at least the end result has fewer than 7/8 digits following ".".

IOW adding 0.0is faster than testing for and branching around

This should be code-able all in SSE using all 4 packed floats.
You would have better precision in the lsb but the end resultsare still SP or in DP if using

double result = (double)high + (double)low;

This will give you up to 7 digits to left of "." and 6/7 digits to right of "."

I am not an expert on the SSE (or intrinsics) I did not see a convert 4 SP floats to 4 SP float integers in one instruction (it may be there) but you can do 4 SP floats to 4 int 32s but the document I have does not show a conversion the other way (because they do not know how to handle potential overflow).

An alternate way to a 4-wide int functionfor SP floating point (result 4 floats with no fraction)
(assuming all positive numbers)

add a 4-wide literal (kept in register) with each containing 2^24.
then immediatly subtracting the same literal.

The first add will flush out any fraction bits, the subtract with remove the 2^24
(use 2^53 for DP hack)

*** caution ***

the above will work provided the internal archetecture of the SSE does not change to maintain residual roundoff bits (similar to FPU instructions).
When (if) than happens, you may need to add an additional move (register to register). I do not forsee this change happening.

Jim Dempsey
Highlighted
Black Belt
271 Views
I forgot to mention that although I did not see the 4xint32 to 4x float conversion, that if the int numbers are well behaved (positive and .lt. 2^24) then, the 4x int32, when useas 4x floats, will appear as denormalized numbers. The SSE should handle these and convert to SP float automagically. Negative int32 numbers will require a negate and or with sign bit (still restriction on negated value being less than 2^24).

Jim Dempsey