Intel® Integrated Performance Primitives
Community support and discussions relating to developing high-performance vision, signal, security, and storage applications.
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.
6629 Discussions

Alarming floating point discrepancies between simple IPP and MSVC++ operations..



I have had this problem now for approximately 4 days and I am at a total loss to explain why I am seeing it. Here is the context:

I have a simple VC++ program using the IPP libraries, where Ihave an Ipp32f vector of length 4, (call it 'aVector')and divide it by an Ipp32f constant, (call is 'aConstant')using ippsDivC_32f. Also the first element of the vector is equal to the constant. So, aVector[0] is equal to aConstant. I store the result in another Ipp32f vector of length 4, and lets call that 'theResult'.

Here is the problem: VERY SURPRISINGLY, my answer for theResult[0] is 0.9999999 !!! Instead of 1.000000!!, when I use the ippsDivC_32f function.

I then went ahead and commented out ippsDivC_32f, and decided to do the division using a good old fashioned for-loop, that would simply divide each element of aVector by aConstant individually, and in this case, I got the correct answer for theResult[0] which is 1.0000000.

I also got the correct answer of 1.000000 when I did the same calculation in MATLAB (forcing it to do single precision of course).

What is the cause of this discrepancy? Why is ippsDivC_32f giving me 0.9999999 instead of 1.0000000? I have scoured and net and everyone and his brother seems to have a different answer or non-answer, it truly seems like the twilight zone. I have read tidbits about how some calculations are done using 80 (!?) bits in the FPU, and how this compiler does this and another compiler does that, but so far I am not clear on what is truly going on.

Any help or insight would be truly appreciated, I am writing code for a reccursive scientific program, which is why this is important.


0 Kudos
6 Replies

I also want to add that the discrepancy isnt limited to such a small value, I have also tested cases on different functions such as ippsSum_32f, where the difference between its result and a summation in a for loop is as high as 1e-4.

In other words, MATLAB in single precision and native C++ code will give *exactly* matching results, but when an equivalent division of addition IPP function is used, it sometimes matches, and more often does not, with errors as large as 1e-4.
New Contributor III

There do tend to be some rounding errors with F32 (float). What did you do in straight C code? Were the values F32 (float) or F64 (double)? Can you show the C code?



Hi Peter,

I will post the exact code for you to look attonight when I get home.

In the meantime, let me try to give you more information - I have a somewhat big project but I can simplify it here.

1)I have a very simpleVC++ function, lets call it 'myFunction', that sits in a DLL that I made, lets call it 'myDLL'.
2) In MATLAB I generated a double precision vector, lets call is 'dblVector', but I typecast it, and save it as a single precision vector, lets call it 'singleVector'. It has length 100.
3) Now in MATLAB, I use 'loadlibrary' toloadmyDLL.
4) Next, I pass in'singleVector' into the 'myFunction' which is called from myDLL. Now, inside my VC++ function, a vector called 'singleVectorInput' has the same exact data as 'singleVector' in MATLAB. (singleVectorInput was made like so: Ipp32f* singleVectorInput = ippsMalloc_32f(100). )

So far, so good, and there are no problems.

5) Now, in the VC++ program, I do some calculations in singleVectorInput. There is matrix multiply, squaring, and summations, but ALL of them are ALWAYS in 32f. There is NO double precision anything anywhere. (EVERYTHING in my VC++ program is of type Ipp32f - no doubles anywhere).
6) Now, after all that, I am left with a vector of length 4, lets call it 'aVector', and a scalar, lets call it 'aScalar'. I then do ippsDivC_32f, storing the result in 'aResult', which is also of type Ipp32f. I pass theResult back into MATLAB so that I can look at it. theResult[0] is now 0.9999999.

Now, if I completely ignore step 6, and instead, write a for-loop that did this:
for (int ii= 0; ii < 4; ii++)
theResult[ii] = aVector[ii] / aScalar;

, Again, I pass theResult back into MATLAB to look at it,and theResult[0],now shows 1.000000.

This is what I cannot answer thats making me pull my hair out!

The problem is that myFunction is actually a recurssive algorithm, so there is error accumulation over many many loops, and in the end, when I compare MATLAB or native C code that does the same thing against the IPP code, the answers are WILDLY divergent, and the IPP result is wrong.

Thanks in advance Peter, and let me know if you need me to elaborate on any part(s)...


The code looks to be using the single precision as computation. The faction part will have 23 bit:

so, it typically have to have 6 to 7 significant digits (1e-6 to 1e-7 for normalized data)

As peter suggest, this may cause by the rounding error, which is expected in float point computation. For example, it the code keeps only 6 significant digits.

1e3 *( 1e-3 - 1e-3/1e4) = 1.0, (1e-3 - 1e-3/1e-4) = 1.0000001, rounds to 1.00

while its value should be: 0.9999

The following page has more discussion on this topic:

For the difference in ippsDivC_32f, besides the code, could you please also attach the input data? so the function owner can have further check if this result is fine.


"Normal" C++ code - even if YOU choose to use 32-bit floats - is usually calculated internally (in the CPU, or should I still say FPU?) with 80 bits.

Instead, SSEx divides, which I suppose IPP uses, are done internally using 32 bits!

So using SSEx divides (or other float operations), you can expect bigger rounding errors. If you don't want that, you cannot really use SSEx and you'll have to revert to (much slower) non-vectorized instructions.

Note that float rounding errors ALWAYS occur. Try the following:

float sum = 0;
for (int c=0; c<1000; c++)
sum += 1.0f;
printf("%f\n", sum);

Chances are that you'll find at least some values that end at .0000001 or .9999999 (or worse). Using doubles of course reduces the size of the mismatch - but it will still be there.

Also, on your 1e-4 rounding errors: How big are the input values? I would be very surprised if you get such errors when adding 1.0f to 1.0f - but for example 1000000000.0f - 1.0f might result in 1000000000.0f...

If you need the precision, either use doubles or don't use SSEx at all. If you don't really need this much precision, just ignore the small rounding errors.
New Contributor III
If you need the precision (and it sounds like you do from the recursive nature of your problem), then you might want to switch to double and ipps_DivC_64f. You'll pay a speed penalty, but it may be better than using f32 and straight C code.