My floating point rounding program behaves differently on X86 and X64 targets

GTHER · ‎02-11-2018

Hello,

I don’t know where to ask this question so I post it here:

I have got a very simple windows C program, compiled with VS studio 2015, that does not behave the same when compiled for X86 or X64 targets.

The program converts the unsigned __int64 number “0x8000000000000410” to a long double and then converts it back to unsigned __int64, with rounding involved because all the bits cannot fit on the 52 bits of the ieee754 64 bit floating point number significand; I tested it on two different machines. On the X86 target, it returns 0x8000000000000800, whereas it returns 0x8000000000000000 on the X64 target (last bit differs after rounding).

The code is the following:

__________________________________________________________________________________

#include <conio.h>

#include <stdio.h>

#include <float.h>

#include <fenv.h>

int main()

{

long double d;

unsigned __int64 w1, w2;

_controlfp_s(NULL, _RC_NEAR, _MCW_RC);

w1 = 0x8000000000000410;

d = (long double)w1;

w2 = (unsigned __int64)d;

printf("Result: 0x%016I64x\n", w2);

printf("\n<press a key to exit...>");

_getch();

return 0;

}

__________________________________________________________________________________

Other precisions:

The testing is done with the “round to nearest” mode (value of MXCSR set to 0x1F80 thanks to the call to _controlfp_s(NULL, _RC_NEAR, _MCW_RC) on the X64 target)

For the “round down” and “round up” modes, the values are the same on both targets (result for “round down” is 0x8000000000000000) whereas result for “round up” is 0x8000000000000800)

However, for the “round toward zero” (chop) mode, the result differs once again, and the result is 0x8000000000000000 on the X86 target, and 0x8000000000000800 on the X64 target).

It seems to me that the correct results are those given on the X86 target, because they are closer to the original number (in the “nearest” case) and closer to zero (in the “toward zero” case).

I would like to know if this result can be reproduced, and if so, if there are settings somewhere that I missed and that could be changed so that the two targets behave the same.

(I join the .cpp file and a compile.bat file that compiles a X86 version and a X64 version of the exe).

Thanks in advance for your kind answer.

TimP · ‎02-12-2018

I think you're saying that your comparison is between a 32-bit Windows OS and X64 Windows. If so, the OS X87 mode settings for launching a .exe are different; 64-bit mode for x86 and 53-bit mode for x64 (although there have been occasional lapses in Windows releases). If you build an application with ICL /QxIA32, the compiler should insert code to initialize to 53-bit mode, unless you have set -Qpc80, which should set 64-bit mode. I'm not sure about CL, except that 32-bit CL used to default to X87 code if you didn't set /arch, while 64-bit CL and ICL have no such setting. I haven't had a 32-bit Windows since X64 became available, but CL doesn't pretend to support long double except as a synonym for double. Evidently, you can over-ride the initial settings of X87 mode, as you have done for simd modes.

I don't think there is anything approaching satisfactory support for 64-bit mode long double in X64. Even in 32-bit OS with ICL, library function calls don't handle 64-bit mode correctly. There may be such differences between a 32-bit program running on 32- and 64-bit Windows, if you don't take care of the mode settings.

Round toward 0 in 32-bit x87 code when casting float to integer is done by inserting instructions to change the rounding mode and then change it back. ICL used to have a setting to compile everything with round to 0, so as to speed up (int) casts; I don't know if it is still supported. 64-bit compilers will be using simd scalar or parallel instructions, again temporarily setting round to 0.

32-bit compilation treats __int64 as a struct of presumably signed and unsigned ints. Your cast may involve a library call, possibly the same one for CL and ICL. You could examine the compiler generated asm code and trace with debugger (if the library doesn't have guards against debug tracing).

The original versions of Windows X64 didn't permit X87 code at all. X87 was enabled primarily to permit legacy 32-bit programs to run, without taking care of bit-for-bit consistency.

GTHER · ‎02-12-2018

Hello Tim P.,

Thank’s a lot for your answer.

Maybe I was not clear in describing what I call a “target” here, but If you take the time to read the compile.bat file that I join with this message, you may see that the compilation I do to get my EXEs is quite simple:

I compile the .cpp that is given in my original message (and that I join also in the zip file) with Visual studio, with “C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\cl.exe” to get a 32 bits code console exe named test_X86.exe and also with C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\amd64\cl.exe to generate a 64 bits version of the exe named test_X64.exe of the exe. Then I launch the two programs. In each of them the rounding mode is set to “nearest” thanks to the call to _controlfp_s(NULL, _RC_NEAR, _MCW_RC); at the start of the program. All of this is done on two different machines, each of them running 64 bits windows 10 pro.

test_X86.exe returns 0x8000000000000800, which is the correct answer for the rounding in the nearest mode, whereas test_X64.exe returns 0x8000000000000000, which is a wrong answer as if it was rounded toward zero instead of nearest.

The 64 bits assembly code is the following on the X64 target:

w1 = 0x8000000000000410;

00007FF60891180E mov rax,8000000000000410h

00007FF608911818 mov qword ptr [w1],rax

d = (long double)w1;

00007FF7A40B181C mov rax,qword ptr [w1]

00007FF7A40B1820 cvtsi2sd xmm0,rax

00007FF7A40B1825 test rax,rax

00007FF7A40B1828 jge main+62h (07FF7A40B1832h)

00007FF7A40B182A addsd xmm0,mmword ptr [__real@43f0000000000000 (07FF7A40B9C80h)]

And I get {0x43e0000000000000, 0x0000000000000000} as a result in XMM0 (rounded toward zero)

The X86 32 bits code is the following on the 32 bits target:

w1 = 0x8000000000000410;

00C71889 mov dword ptr [w1],410h

00C71890 mov dword ptr [ebp-18h],80000000h

d = (long double)w1;

00C71897 mov edx,dword ptr [ebp-18h]

00C7189A mov ecx,dword ptr [w1]

00C7189D call __ultod3 (0C71005h)

00C718A2 movsd mmword ptr ,xmm0

And I get {0x43e0000000000001, 0x0000000000000000} as a result in XMM0 (rounded nearest, as expected)

I also added a call to _mm_getcsr() intrinsic right before that (not in the program I posted) in order to check that the value of mxcsr was positioned correctly by calling crt’s function _controlfp_s (MXCSR is set to 0x00001F80 so the rounding bits 14 and 13 are set to their correct “nearest” value of 00).

I followed your advice and debugged, and I think I found out why the problem is happening. This is probably due to the code generated by MS Visual studio 2015 trying to work with negative values before adding 2^64 to make the number positive, which looses precision on the last bit. In 32 bits mode, there are two executions of cvtsi2sd, one for the lower 32 bits dword, one for the higher 32 bits dword. This all fits on the significands and then gets added correctly. On the X64 target, the last bit is lost during the conversion from negative to positive. I don’t know if efficient X64 code as short as the one compiled here could be written that does not require more precision by using cvtsi2sd twice instead of once like this is done on the X32 code.

zalia64 · ‎02-26-2018

On Intel X86, there are two possible ways to do floating-point calculations:

The old way, using the old floating-point co-processor engine that uses extended 80-bit internal representation ,

Or using the new SSE floating-point hardware, using either 64-bit representations or 32-bit representations. ' double' enforces 64-bit representation.

The old, extended-precision engine is called using mnemonics like FMUL, FADD, FSUB, FDIV and work using a special floating-point stack. Everything - short integers, integers, long, _int64, float, doubles, long doubles - everything is converted to the extended precision format. The calculations are done using this extra accuracy. Results are converted back (and truncated ) to the storage class of the result variable.

The new engine uses registers Xmm and SSE mnemonics like 'movSD, mulSD, addSD ' for doubles. There are different mnemonics for every storage class.

Conversion from the doubles to integers could be done using both engines, using different mnemonics, with different results ...

The name 'long double' , in many C compilers, refers to the 80-bit representation. BUT NOT ALL OF THEM. Citing Wikipedia,

On the x86 architecture, most C compilers implement long double as the 80-bit extended precision type supported by x86 hardware (sometimes stored as 12 or 16 bytes to maintain data structure alignment), as specified in the C99 / C11 standards (IEC 60559 floating-point arithmetic (Annex F)). An exception is Microsoft Visual C++ for x86, which makes long double a synonym for double.^[2] The Intel C++ compiler on Microsoft Windows supports extended precision, but requires the /Qlong‑double switch for long double to correspond to the hardware's extended precision format.^[3]

SergeyKostrov · ‎08-03-2018

Your test number is -9223372036854774768 ( this is actually 0x8000000000000410 ) exceeds a maximal number for Double Precision values when no rounding is applied ( 53-bit precision ). 2^53 = 9007199254740992 and it is less than abs( -9223372036854774768 ) = 9223372036854774768. Since your test number is greater than maximal number when no rounding corrections are applied your back conversion is completed with rounding and it doesn't matter how FPU was initialized, doesn't matter if the test was executed on x86 or x64 platforms. Here is a test case that demonstrates when rounding starts "changing" Double Precision values: ... // Sub-Test 6.3 - Limitations of IEEE-754 Standard for DP ( 53-bit ) arithmetic { RTdouble dV0 = ( 9007199254740992.0L + 0.0L ); // 9007199254740992.0 RTdouble dV1 = ( 9007199254740992.0L + 1.0L ); // 9007199254740992.0 !!! ( expected 9007199254740993.0 ) RTdouble dV2 = ( 9007199254740992.0L + 2.0L ); // 9007199254740994.0 RTdouble dV3 = ( 9007199254740992.0L + 3.0L ); // 9007199254740996.0 !!! ( expected 9007199254740995.0 ) RTdouble dV4 = ( 9007199254740992.0L + 4.0L ); // 9007199254740996.0 CrtPrintf( RTU("Limitations of IEEE-754 Standard for DP ( 53-bit ) arithmetic:\n") ); CrtPrintf( RTU("\t( 9007199254740992.0L + 0.0L )=%.1f\n"), dV0 ); CrtPrintf( RTU("\t( 9007199254740992.0L + 1.0L )=%.1f - expected 9007199254740993.0\n"), dV1 ); CrtPrintf( RTU("\t( 9007199254740992.0L + 2.0L )=%.1f\n"), dV2 ); CrtPrintf( RTU("\t( 9007199254740992.0L + 3.0L )=%.1f - expected 9007199254740995.0\n"), dV3 ); CrtPrintf( RTU("\t( 9007199254740992.0L + 4.0L )=%.1f\n"), dV4 ); } ...

SergeyKostrov · ‎08-03-2018

>>On Intel X86, there are two possible ways to do floating-point calculations: >> >>The old way, using the old floating-point co-processor engine that uses extended 80-bit internal representation , >> >>Or using the new SSE floating-point hardware, using either 64-bit representations or 32-bit representations. ' double' enforces 64-bit representation. See the answer in my previous post. 1. 32-bit representation of a Single Precision FP-value has 24-bit precision. A maximal SP-value that won't be rounded is 16777216 ( 2^24 ). Period. 2. 64-bit representation of a Double Precision FP-value has 53-bit precision. A maximal DP-value that won't be rounded is 9007199254740992 ( 2^53 ). Period. There is No need for an MS, or PHD, thesis on that very simple and fundamental thing for Floating Point arithmetic described in IEEE-754 Standard many years ago.

SergeyKostrov · ‎08-03-2018

>>...Your cast may involve a library call, possibly the same one for CL and ICL... There are No any calls during such conversions. Once again, the problem described is actually Not a problem and this is Absolutely correct conversion since FP arithmetic according to IEEE-754 Standard has limitations. See previous post for explanations. Simply run a test I've submitted in Post #5 and you will see how such rounding starts changing values of DP or SP FP variables! ( 9007199254740992.0L + 0.0L ) = 9007199254740992.0 ( 9007199254740992.0L + 1.0L ) = 9007199254740992.0 // ( expected 9007199254740993.0 ) ( 9007199254740992.0L + 2.0L ) = 9007199254740994.0 ( 9007199254740992.0L + 3.0L ) = 9007199254740996.0 // ( expected 9007199254740995.0 ) ( 9007199254740992.0L + 4.0L ) = 9007199254740996.0

GTHER · ‎08-04-2018

Hello Sergey, thank you for your answers.

The problem comes from an incorrect code generated by MS Visual Studio to convert 64 bits unsigned integer to 64 bits ieee754 double precision floating point number, on a windows 64 bits target. The code generated by MSVC is different, and performs correctly, on the windows 32 bits target.

It comes from the fact that the CVTSI2SD instruction converts SIGNED 64 bits integer (and not unsigned) to ieee754 64 bits double precision floating point number, but here we want to convert an UNSIGNED 64 bits integer. As we can not use the VCVTUSI2SD instruction, that would perform the correct conversion directly (because it is an AVX512 instruction and we want to avoid AVX512 instructions for the moment) then a workaround needs to be found to use CVTSI2SD to perform the UNSIGNED instead of SIGNED conversion.

The code generated by MS visual studio does it incorrectly, because it first converts a number that may be negative, then tries to correct the situation in the negative case to get a positive result from the negative result, by adding 2^64 as a double precision floating point number (it actually adds a double precision floating point number with 43F as its exponent and 0 as its significand -

(addsd xmm0, mmword ptr [__real@43f0000000000000]) see post above (quote #3) with the MSVC generated code).

But because the negative and positive situations are inverted, then the rounding results are also inverted, thus the incorrect, inverted rounding result.

Instead, the following code should do the trick correctly, because it starts by right shifting the number in the negative case, in order to remain positive, then adds one to the exponent in the end because the number to convert has been divided by 2:

_TEXT SEGMENT
convert_ui64_to_f64 PROC

; in: rcx = unsigned 64 bits integer (unsigned __int64)
; out: xmm0 = ieee754 64 bits floating point number (long double)

test rcx, rcx

jns __bit_63_not_set

__bit_63_set:

shr rcx, 1

setc al

or cl, al

cvtsi2sd xmm0, rcx

movd  rax, xmm0
mov   rcx, 0010000000000000H
add   rax, rcx
movd  xmm0, rax

ret 0

__bit_63_not_set:

cvtsi2sd xmm0, rcx

ret 0

convert_ui64_to_f64 ENDP
_TEXT ENDS