Targeting 486 instead of 686 to troubleshoot small FP variations

rokelley · ‎07-10-2023

We have an app originally built withe IPSXE 2018 that we're now building with oneAPI 2022.0, but we're getting slightly different floating point results around the 5th significant figure. Both builds are targeting Windows IA32 x86/x87. When I compared the assembly listing from both builds, I noticed the IPSXE version was using the .486p & .387 directives and the oneAPI version was using the .686p & .387 directives. Is there a way to limit the compiler to .486p? Are there some options I should try? As of now I've tried disabling optimization, rounding/not rounding floating point results (/Qfp-port) disabling speculation of floating point operations, using the safe floating point model, and enabling Minus Zero support.

mecej4 · ‎07-11-2023

Intel OneAPI and Parallel Studio compilers do not produce assembly files unless requested, so any compiler directives (and everything else, for that matter) that you see in the output assembly files are for information purposes. It would be better for you to record the two sets of compiler options that you used with the old and new compilers.

Older versions of the compiler accepted the /arch:IA32 option, and generated x87 FPU instructions. The current OneAPI (correction: ifx) compiler does not generate x87 instructions, and rejects the arch:IA32 option. Since the x87 and SSE floating point registers and instructions are quite different, it is not surprising that the x87 and SSE versions of your program gave somewhat different results.

You may confirm my speculation by producing disassembly listings (dumpbin /disasm xyz.obj) from the new and old OBJ files and comparing them.

rokelley · ‎07-11-2023

I think ifort still supports /arch:IA32. I disassembled both bins and I can see the x87 instructions and ST registers in both. The same compiler options were used in both builds AFAIK but I'll scrape the actual options used out of the build log here in a second.

rokelley · ‎07-11-2023

The actual ifort options used in oneAPI 2023.1:

/nologo
/O2
/arch:IA32 
/Qdiag-error-limit:50 
/warn:all 
/Qsave 
/Qinit:zero
/libs:static 
/threads 
/c 
/Qm32

The same for IPSXE2018:

/nologo
/O2
/arch:IA32 
/Qdiag-error-limit:50 
/warn:all 
/Qsave 
/libs:static 
/threads 
/c
/Qm32

mecej4 · ‎07-11-2023

I took a short (9 lines, single assignment statement in a DO loop) subroutine, and compiled it with the options that you reported (I am using the current OneAPI compiler and the Parallel Studio 2013SP1 compilers). Comparing the assembler files and the disassembled OBJ files did not show me any differences that could cause the results to differ. I did note the .486P versus the .686P, but that should not matter.

Will you be able to supply a short "reproducer", i.e., a minimal working example code, that can be used to establish that the two compiler versions lead to different floating point results after running?

jimdempseyatthecove · ‎07-11-2023

>> I did note the .486P versus the .686P, but that should not matter.

It will if SSE/AVX code is generated as the FPU internally uses 80-bit floating point. When temporarry intermediary results in an expression are stored on the FPU stack they are stored as 80-bit(as opposed to 64/32 bit when in SSE/AVX register). This can result in loss of precision.

Also, optimizaton may differ sequence of operations, which can alter result approximation.

CAUTION it has been rumored that future CPU's might eliminate the FPU instructions. You are advised to migrate to AVX. Which may require you to produce or find a different set of certified results data.

Jim Dempsey

rokelley · ‎07-11-2023

We're using

/arch:IA32

So AFAIK MMX/SSE/AVX instructions are not in use. I'll try to come up with some sanitized example code that reproduces the behavior we're seeing.

Steve_Lionel · ‎07-13-2023

I suggest you read Improving Numerical Reproducibility in C/C++/Fortran - there are many factors that can change last-bit results, some of which you cannot control.

Ron_Green · ‎07-13-2023

Steve's reference gives you the core options for FP control.

I have attached another reference for additional details.

You can also use

-dryrun

or

-#

with both compilers. This will cause the driver to print all defines, libs, paths, and options passed from the driver to the backend compiler.

You could also try IFX instead of IFORT. In the interplay of optimization versus accuracy, IFX leans more towards accuracy.