- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hello Sir,

I have a precision issue with the below code. If I do the calculations for the same input in my calculator I get -13421772.8

Whereas with compiler I get -13421773.0, and this is a considerable difference for us.

The variable used for the above observation is ‘tmp’.

Please help us in resolving this.

Thanks in-advance.

void convert(__m128 &vrz /*inout*/, int art)

{

unsigned int _rounding_mode;

if(1)

{

_rounding_mode = _MM_GET_ROUNDING_MODE();

_MM_SET_ROUNDING_MODE(_MM_ROUND_TOWARD_ZERO);

}

__m128 tmp, scale_vr;

const float scale = (float)((unsigned int)1<<(31-(art)));

scale_vr = _mm_set1_ps(scale);

tmp = _mm_mul_ps(vrz, scale_vr);

vrz = _mm_insert_ps(vrz, _mm_castsi128_ps(_mm_cvtps_epi32(tmp)) , ((1)<<6) | ((1)<<4));

if(1)

{

_MM_SET_ROUNDING_MODE(_rounding_mode);

}

}

void main()

{

float a =( float) -0.8;

m128 vrz;

vrz = _mm_set1_ps(a);

Convert(vrz,7)

}

Thanks,

Eswar Reddy K

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**Not**related to any C++ compiler or command line options, etc. It is related to

**limitations of Single-Precision**arithmetics. In order to improve the precision of your calculations a change to

**Double-Precision**arithmetics needs to be done. Try these simple tests: 16777216.0f + 1.0f = 16777216.0f - !!! - It is

**Not**16777217.0 due to limitation of Single-Precision arithmetics 16777216.0f + 2.0f = 16777218.0f 16777216.0f + 3.0f = 16777220.0f - !!! - It is

**Not**16777219.0 due to limitation of Single-Precision arithmetics

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

sorry display proble... below are my compiler options:

WarningLevel: Level3

Optimization: Disabled

UseProcessorExtensions:AVX2

BasicRuntimeChecks : Default

AdditionalOptions : /fp:precise

FlushDenormalResultsToZero : false

FloatingPointModel: Precise

FloatingPointExpressionEvaluation: Default

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Eswar,

At issue here may be:

float a = (float)-0.8;

Where a does not use the same rounding mode (round down). As a quick test, compile as Debug build. After setting a=, open a Memory window and examine "&a". View as unsigned 1-byte integer. You should see "205 204 76 191". Subtract 1 from the 205 to undo the round up. Had this been zero, then 0-1 produces 255 with borrow propigating to next byte (i.e. subtract 1 from next byte). There will be some cases where the exponent will need to be adjusted, but this is not necessary for this experiment.

Once the value of a has been adjusted, continue and check the result.

For a formal fix, you will have to be careful as to how you preset your parameters that contain fractional values that cannot be precisely represented in binary. 0.1 is one such fraction as is 0.8.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**_mm_mul_ps**( actually,

**MULPS**instruction ) rounds the results (!). I've created my own test-case and debugged it. Here are some details: Note: 16777216 = 2^24

**Correct Result ( True )**: 16777216 * 0.8 = 1342177

**2.8**- everything is correct /

**_mm_mul_ps**is Not used

**Incorrect Result**: 16777216 * 0.8 = 1342177

**3.0**- something is wrong /

**_mm_mul_ps**is used / rounding is done by

**MULPS**instruction I will spend some additional time this week however I would consider a workaround since I really do not expect that Intel will release a microcodes patch for the

**MULPS**instruction unless we understand what is wrong.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**[ Debug ]**

**Test-Case 1**( 16777216 * -0.8 ) Expected Values : -13421772.800000 -13421772.800000 -13421772.800000 -13421772.800000 Calculated Values: -13421773.000000 -13421773.000000 -13421773.000000 -13421773.000000

**Test-Case 2**( 16777216 * 0.8 ) Expected Values : 13421772.800000 13421772.800000 13421772.800000 13421772.800000 Calculated Values: 13421773.000000 13421773.000000 13421773.000000 13421773.000000 ...

**[ Release ]**

**Test-Case 1**( 16777216 * -0.8 ) Expected Values : -13421772.800000 -13421772.800000 -13421772.800000 -13421772.800000 Calculated Values: -13421773.000000 -13421773.000000 -13421773.000000 -13421773.000000

**Test-Case 2**( 16777216 * 0.8 ) Expected Values : 13421772.800000 13421772.800000 13421772.800000 13421772.800000 Calculated Values: 13421773.000000 13421773.000000 13421773.000000 13421773.000000 ...

**Note**: Intrinsic function

**_mm_mul_ps**is used for

**Calculated Values**.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Thanks Sergey & Jim !

I have obsrved same behaviour irrespective of the configuration.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**Test-Case 5**Expected Values : 13421772.800000 13421772.800000 13421772.800000 13421772.800000 Calculated Values: -13421772.800000 -13421772.800000 -13421772.800000 -13421772.800000

**Test-Case 6**Expected Values : 13421772.800000 13421772.800000 13421772.800000 13421772.800000 Calculated Values: 13421772.800000 13421772.800000 13421772.800000 13421772.800000 ...

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Sergey Kostrov,

The results looks ok for test cases 5 & 6.

Can please provide compiler options and other options if any for the test cases 5 & 6

Thanks,

Eswar Reddy K

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**Not**related to any C++ compiler or command line options, etc. It is related to

**limitations of Single-Precision**arithmetics. In order to improve the precision of your calculations a change to

**Double-Precision**arithmetics needs to be done. Try these simple tests: 16777216.0f + 1.0f = 16777216.0f - !!! - It is

**Not**16777217.0 due to limitation of Single-Precision arithmetics 16777216.0f + 2.0f = 16777218.0f 16777216.0f + 3.0f = 16777220.0f - !!! - It is

**Not**16777219.0 due to limitation of Single-Precision arithmetics

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Actually rounding is probably done by micro-operation control signal (mulps decoded into corresponding uop).It is interesting what triggers the execution of rounding mode(some control bit being set when mulps is decoded)by SIMD FPU.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Thank you!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**IEEE 754 Standard**describes all that stuff and take a look at it. The most accurate representation of 13421772.8 is 13421773.0. In a binary form both numbers look like: 13421772.8 = 13421773.0 = 0x4B4CCCCD = 0 10010110 10011001100110011001101

**Note 1**: 1st digit is a Sign ( 0 is for positive ), followed by Exponent, followed by Mantissa.

**Note 2**: Use Debugger to verify it.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**Support of 'long double' floating point data type on Intel CPUs ( A collection of threads )**Web-link: http://software.intel.com/en-us/node/375459

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Thanks for the insight.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page