I am getting inconsistent floating point results when i run an application on different architecture. This happens after the third decimal point. I am able to get repeatable results on the same architecture but when i run on a different architecture i get inconsistencies in floating point results. The two architecture i'm talking about is nehalem and sandy bridge. The application is a 64 bit application. I am using Visual Studio 2013 which has Intel composer XE 2013 C++ sp1. I tried the solution recommended in the link below, but it didn't change anything. The industry in which i work, data accuracy upto 16th decimal point is extremely important. Any help is appreciated
To get consistent floating-point results, you should set "Precise (/fp:precise)" on the "Floating Point Model" line of the Code Generation tab in project properties. If your application calls math functions, you should also add "/Qimf-arch-consistency:true" to the compiler command line in project properties. This should give you consistent, reproducible results for serial code between different microarchitectures such as those codenamed Nehalem and Sandy Bridge, even if you are targeting different instruction sets. It doesn't necessarily give identical results on different architectures such as IA-32 and Intel64 or Intel MIC architecture. For the latest microarchitecture, code named Haswell, you would need to add /Qfma-. All this should be discussed in the article that you quote.
Parallel code is another matter. Threaded reductions, for example, can give rise to variations even on the same microarchitecture, unless you take precautions.
If the above doesn't work, you should check your command line for any options that might be overriding /fp:precise (or cut and paste it here). /Qftz is one example; switches that affect the floating-point environment such as /Qpc or /Qrcd are others. Otherwise, if you can send us an example of some serial code that gives different results on Nehalem and Sandy Bridge, we'll be happy to investigate.
We understand that in some circumstances, exact reproducibility of results is important. Reproducibility is not the same as accuracy, though. In many cases, the tiny variations in floating-point results, due to microarchitecture, different optimization levels, etc., are much smaller than the uncertainties due to the finite accuracy of floating-point arithmetic. If you are getting variations in the 3rd or 4th significant figure, either you have some very large cancellations in your computation, or something else is going on beyond natural floating-point variations due to differences in rounding or implementation of math functions, in which case, we'd need to investigate differently.
The command line options are as below
/Zi /nologo /W2 /O2 /Oi /D "WIN64" /D "NDEBUG" /D "_WINDOWS" /D "_VC80_UPGRADE=0x0600" /D "_WINDLL" /EHsc /MT /GS /fp:precise /Zc:wchar_t /Zc:forScope /Fp"Release\x64\DiffractEngine.pch" /Fa".\Release/" /Fo"Release\x64\" /Fd"Release\x64\vc80.pdb"
/G7 /O2 /QxW /QaxT /Qprec /Qprec_div /Qip /Qunroll /fp:precise /Qftz /Qimf-arch-consistency:true /fp:source
And the floating point model was set at Precise.
You should remove /Qftz for testing, though I think it's pretty unlikely this will change anything in your case. You don't need to use /Qprec or /Qprec-div when you are using /fp:precise.
The switches /QxW and /QaxT have been deprecated for years, the modern forms are /QxSSE2 and /QaxSSSE3. You might want to rethink whether you are getting any benefit from these. Whilst there should not be any connection to your problem, I would remove them while you are testing and investigating your consistency issue, to simplify things. The option /G7 is also very old and deprecated; it has actually been removed from the next compiler version. It does nothing, so should be removed. The option /Qunroll is default, so that also does not need to be in the extra options. You now have /fp:precise twice, but that won't hurt.
Since you are not using any switches such as /QxAVX or /QaxAVX that would generate Sandy Bridge-specific code, the compiler-generated code should be the same, whether it is run on Nehalem or Sandy Bridge. Differences should normally only come from libraries that contain processor-dependent code. The Intel math library (libm) contains such processor-dependent code, that's why you need the switch /Qimf-arch-consistency:true, to ensure the same path is taken for all processors. If you link to the Intel Math Kernel Library, you may need to call a special API to ensure the same results on all processors. If you link to any other libraries, excepting the regular Intel and Microsoft runtimes, it's conceivable that they could contain some processor-dependent code.
If none of this shows up anything suspicious, (and I don't really expect it to), the next step would be to compile at a different optimization level, (start at -O1 and remove /Qip). If the difference goes away, that gves us another handle (the difference between -O1 and -O2). If the difference remains, repeat at -Od. This will help narrow down the source of the problem.
I would use something like /arch:SSE4.1 for consistency between Nehalem and newer architectures, although, as Martyn said, your settings should take the same code branch on all CPUs which support at least SSE3.
If you use MKL, you are likely to need "conditional numerical reproducibility" options there to prevent use of AVX (faster, slightly different results) on Sandy Bridge.
what api should i use for this? "if you link to the Intel Math Kernel Library, you may need to call a special API to ensure the same results on all processors". I tried your recommendations regarding the compiler and did not see any changes.
@Tim, i came across the link below but can't find the place to set the CNR options in VS2013. Is the article referring to system variable?
It's in that reference, you set MKL_CBWR=COMPATIBLE mode either by environment variable or by MKL function call so as to restrict MKL to that mode, as well as following advice on alignments and consistent numbers of threads. As far as I can see, what Martyn refers to as a different API consists simply in calling that function entry, The imf-arch-consistency for svml calls, by contrast, involves a different set of internal library function names.
The SSE2 setting for MKL_CBWR apparently allows for use of non-IEEE accurate SSE instructions, where the results on non-Intel CPUs are expected to differ.
You have the option of setting SSE4_1 for reproducible results among all Intel CPUs supporting those instructions, but dropping back to SSE2 with possible slightly different results on other CPUs, or AVX to avoid using AVX2 instructions such as fma.
There's no explicit IDE option for conditional numerical reproducibility in MKL. You can either make calls from your source code, e.g.
use mkl_service !(compile mkl_service.f90 if necessary,
or include mkl.fi)
istatus = mkl_cbwr_set(MKL_CBWR_SSE4_2)
or set a run-time environment variable (this you can do in the MSVS IDE, from Project Properties/Configuration Properties/Debugging/Environment):
"SSE_4_2" assumes that you will only run on Nehalem (Intel Core Processors) or later Intel processors.
There's a bit more detail in the web article that you quote, and much more in the MKL user guide (under "Obtaining Numerically Reproducible Results") and the MKL reference manual, which is accessible through MSVS. You can verify the settings by the call
istatus = mkl_cbwr_get(MKL_CBWR_BRANCH)
Note the other requirements, such as a fixed number of threads. Tim's suggestion of data alignment with -align array32byte is also good. The "assume" settings he mentions are already implied by /fp:precise.
Sergey Kostrov wrote:
>>...I am getting inconsistent floating point results when i run an application on different architecture...
That issue was discussed many times on IDZ and my best advice is to use '_controlfp' CRT function to initialize the FPU at the beginning of calculations. I've been using that approach for many years for scientific applications and results are reproducible even on MS-DOS operating system, between Intel 486, Pentium 2, Pentium 4, Ivy Bridge CPUs, between different C++ compilers, etc.
Windows X64 sets _controlfp to a standard setting before turning over control to a .exe. The Intel compiler /Qftz[-] setting mentioned above puts code in main() to over-ride that bit in mscr. There's nothing there which would account for a difference between Nehalem and Sandy Bridge on the same OS or reconcile differences due to changing instruction set.
As the discussion developed above, it became evident that MKL library was in use with different vector widths and numbers of threads on the two platforms.