Solved: Compiler options yield strange precision issue

prop_design · ‎07-27-2022

Hi,

I noticed this issue a few months ago. It's very strange. I was getting slightly different results between codes that use the same math. I eventually found the compiler options were the cause. The following compiler options were causing the issue:

ifort PROP_DESIGN_XYZ.f /O3 /arch:core-avx2 /QaxCOMMON-AVX512 /Qopt-zmm-usage:high /Qm64 /Qprec-div /Qprec-sqrt /Qsimd-honor-fp-model /static /warn:unused /Qsave-temps-

ifort PROP_DESIGN_ANALYSIS.f /O3 /arch:core-avx2 /QaxCOMMON-AVX512 /Qopt-zmm-usage:high /Qm64 /Qprec-div /Qprec-sqrt /Qsimd-honor-fp-model /static /warn:unused /Qsave-temps-

ifort PROP_DESIGN_OPT.f /O3 /arch:core-avx2 /QaxCOMMON-AVX512 /Qopt-zmm-usage:high /Qm64 /Qprec-div /Qprec-sqrt /Qsimd-honor-fp-model /static /warn:unused /Qsave-temps-

The fix was to change the options to:

ifort PROP_DESIGN_XYZ.f /O3 /arch:generic /QaxSSE2 /Qopt-zmm-usage:high /Qm64 /Qprec-div /Qprec-sqrt /Qsimd-honor-fp-model /static /warn:unused /Qsave-temps-

ifort PROP_DESIGN_ANALYSIS.f /O3 /arch:generic /QaxSSE2 /Qopt-zmm-usage:high /Qm64 /Qprec-div /Qprec-sqrt /Qsimd-honor-fp-model /static /warn:unused /Qsave-temps-

ifort PROP_DESIGN_OPT.f /O3 /arch:generic /QaxSSE2 /Qopt-zmm-usage:high /Qm64 /Qprec-div /Qprec-sqrt /Qsimd-honor-fp-model /static /warn:unused /Qsave-temps-

I have a 11th gen Tigerlake laptop processor and am using Windows 10. The details of which are:

Processor 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz, 2419 Mhz, 4 Core(s), 8 Logical Processor(s)

Windows 10 Home Version 10.0.19044 Build 19044

It seems like a problem with the use of FPU registers. The SSE2 registers seem to work as expected while the AVX2 do not. I don't know if this behavior is dependent on the processor or the compiler.

I created a program called PROP_DESIGN about 14 years ago. I worked on it since that time. Only recently have I noticed this problem. In the past, I have had AMD and Intel processors and used two other compilers besides Intel Fortran. Intel Fortran has always worked the best, so I prefer to use it. This issue is concerning though.

I see the problem when looking at the results of the various codes; XYZ, OPT, ANALYSIS. They come up with slightly different answers when using the /arch:core-avx2 /QaxCOMMON-AVX512 options. I have tried many other options, but this seems to show the issue the best. All the codes are freely available on my website, if you want to test them yourself.

I can provide more information if necessary.

Thanks,

Anthony

Ron_Green · ‎07-27-2022

As Jim mentioned, it could possibly be fused multiply add (FMA) instructions added in AVX2. You can look up and try compiler option –no-fma. You can also try -no-vec which is a big hammer disabling all vectorization. And of course -fp-model precise is a go-to first stab at numerical precision. these are the linux options, there are / versions for Windows.

interestingly, if it is FMA note that FMA is MORE precise than a separate mult and add.

View solution in original post

jimdempseyatthecove · ‎07-27-2022

What are you comparing your new compilation with. ?Old build, different system?

Note, If the old build (master "gold" results) was build using SSE2, and the new build runs using AVX512, then any loops performing a reduction are subject to round off differences (whenever the dynamic range of the smallest values .AND (largest values .OR. current running sum) exceed the precision of the floating point system. And/or the values are not algebraically exact.

There is a similar issue (to differing vector widths resulting in different round off reductions) when you use different number of multiple threads in a reduction.

And use of FMA versed non-FMA code can produce different approximate results.

Jim Dempsey

prop_design · ‎07-27-2022

hi jim,

i'm using the latest intel compiler. i noticed the problem when working with PROP_DESIGN. There are different codes that contain the same math. This issue was in the geometry creation math. XYZ, ANALYSIS, and OPT (as well as all the other codes in the package) use the same math in many instances. So they should come back with the exact same answer. However, that's not the case when using certain compiler options. SSE2 works as expected. The AVX2 does not. I did try FMA during one test, but that didn't seem to be of much use here. It didn't seem to be related to the problem either. The best I can tell is something is going on with the precision when these options are used:

/arch:core-avx2 /QaxCOMMON-AVX512

I also tried:

/arch:tigerlake /QaxTIGERLAKE

Which is for my specific processor. I briefly tested ifx /arch:core-avx2, however, I don't recall if the issue was present with that compiler or not. I'll have to setup the test case again. When I first ran into the problem, I didn't know how to report it to anyone. I just now found I could register for the Fortran forum and thought I should mention it.

Thanks,

Anthony

prop_design · ‎07-27-2022

hmm,

i tried to duplicate what i was seeing in the past and i can't. if i run into it again, i will provide more info. sorry about that. i read my old notes on the issue and it was between XYZ and ANALYSIS where I noticed the problem. the other codes also use the same math. however, with regards to this, there is often no output where you could check for the issue.

anthony

Ron_Green · ‎07-27-2022

As Jim mentioned, it could possibly be fused multiply add (FMA) instructions added in AVX2. You can look up and try compiler option –no-fma. You can also try -no-vec which is a big hammer disabling all vectorization. And of course -fp-model precise is a go-to first stab at numerical precision. these are the linux options, there are / versions for Windows.

interestingly, if it is FMA note that FMA is MORE precise than a separate mult and add.

prop_design · ‎07-27-2022

hi ron,

i wasn't aware of the /fp option. i think the /fp:consistent option would have probably fixed what i was seeing. i added that to my test cases. i'm going to accept your reply as the solution, since it seems like that would have worked. i had previously tried:

/Qprec-div /Qprec-sqrt /Qsimd-honor-fp-model

but those didn't help the problem. i think you are saying that when you use /arch:core-avx2 it automatically adds /Qfma. i didn't know that. i had been manually adding /Qfma thinking that turned it on and it was otherwise off. so that could explain why i didn't see much of a difference by turning it on.

anthony

Steve_Lionel · ‎07-27-2022

An oldie but still a goodie - Improving Numerical Reproducibility in C/C++/Fortran (supercomputing.org)

prop_design · ‎07-27-2022

thanks guys,

when i tried to reproduce it today, for this forum post, i was very surprised to see that i couldn't. i spent several hours trying to get it to occur again. i don't know what i did to make it go away permanently. there were some code changes. however, i wouldn't have expected them to have anything to do with this. unfortunately, i don't keep old versions of the code. otherwise, i'd have way too much data to store.

i saved your presentation. thanks for that. also thanks to ron for the tips. as far as i know, fma didn't really have any affect on my codes. so i haven't been using it. i have a number of compiler option tests that i use to figure out what works best. one of them is testing fma.

the only thing i can figure, right now, is the root problem must have been something in my code. then the compiler option change fixed it. at least about a month ago. now, with the current code, all the compiler options work as expected. so i'm completely puzzled and disappointed that i can't duplicate it.

i tried to look through the codes and see if i might have changed anything that would matter. i didn't see anything. at the time i had the problem, the codes appeared to be the same as well. which is what got me into trying different compiler options.

i should also say congrats on making a fantastic fortran compiler. i have been testing it for the last 14 years and it's always 2-3x faster than gfortran. it was also faster than the old portland group compiler. by faster, i mean my executable files run 2-3x faster. i noticed your beta ifx compiles really fast. the run times seem good as well. however, due to the huge difference in supported compiler options, i have stuck with ifort. but it looks like your future compiler will be good as well.

Ron_Green · ‎07-28-2022

@prop_design glad the code is giving expected results. Sometimes run to run data alignment can cause this. you might try /align:array64byte to get consistent data alignment of arrays from run to run.

For IFX, I'm curious if there are any ifort options you would like to see in ifx. We are working on the -ax option: this is proving to be remarkably difficult in LLVM but it is in the works. I can help prioritize missing options, or explain why we can't do those in ifx. Or perhaps suggest a comparable option.

thanks again for the feedback.

prop_design · ‎07-28-2022

Hi Ron,

Thanks for the tip. You guys really know your product. I kind of stumble across things from time to time. Then I test them to see if they help or not. I've read through the manual many times. However, I've clearly missed a lot. Over the years, I have found some nice/helpful features that I didn't know about. For instance, I just found an option to automatically delete the *.obj files. Which is very helpful.

From what i can tell, so far, my codes produce repeatable results. So I'm not sure the issue you brought up would apply. I wish I knew what was going on. But, it seems to be completely gone right now. I'm disappointed I wasn't a part of the forum a month ago. I would have had a great test case. When I created this post, I assumed the problem was still present, since the only thing I did to fix it was change the compiler options.

As far as your question. I don't know that there are any additional options I would need. I think it was just confusion on my part. Since you have so many options for ifort and not for ifx, I assumed you had a long way to go to finish ifx. As far as run speeds, ifx is very fast in compiling and running. So in those regards, I could probably use it now. The only real options I need are the /static option, so I can distribute my code and some way to compile for as many x86-64 processors as possible. At one point, I was so confused as to how to make sure the Intel compiler would make code that ran on AMD and Intel, that I gave up and switched compilers. Then that compiler got bought up and cancelled by nVidia. gfortran always creates exe files that run very slow. So I didn't want to use that again. Also, many of their features only work on Linux and the user manual doesn't say that. So I kept reading your user manual and think I may have finally figured out how to get a exe file that will run on the most processors possible. Usually, I only have an AMD or an Intel computer, at any given time. So I usually can't test to make sure the exe is working as intended. But, that would be the only other requirement for me personally. All the tuning options and fine print for them are hard to understand, for an outsider. Your user manual is really good, but it's still hard to follow some times. Mainly, in the area of if a feature works on AMD or just Intel processors.

These are some of the setups I have been testing. This one I use for the code I distribute:

/O3 /arch:generic /QaxSSE2 /Qopt-zmm-usage:high /Qm64 /Qprec-div /Qprec-sqrt /Qsimd-honor-fp-model /static /warn:unused /Qsave-temps-

This provides a little more speed but had the issue I was trying to report in this forum topic:

/O3 /arch:core-avx2 /QaxCOMMON-AVX512 /Qopt-zmm-usage:high /Qm64 /Qprec-div /Qprec-sqrt /Qsimd-honor-fp-model /static /warn:unused /Qsave-temps-

This is the ifx setup I have been testing:

/O3 /arch:core-avx2 /Qm64 /static /warn:unused /Qsave-temps-

The /warn:unused was another feature I only recently found. This was a 'huge' thing for me, that really helped a lot. I had been going through an elaborate process to find unused variables. So I didn't always check for them and they would build up over time. Now, it's completely automatic and checks every compile. So I was very happy to find that one. I don't recall if I tested ifx with the sse2 extensions or if they are even present. I haven't looked into ifx that much to be honest. I've always been very happy with ifort and kind of sad that it was being replaced. It looks like the options to 'ensure floating point math is done correctly' are missing from my ifx test case. So, perhaps, they are not available yet. If that's the case, then those would be useful. I always try to get all the precision that I can out of my codes. I even go through the trouble of testing quad precision versions to debug precision issues. I would rather have precision over speed, within reason. Since ifort already provides the speed, I can afford to back off a little to ensure precision.

jimdempseyatthecove · ‎07-28-2022

>>There are different codes that contain the same math. This issue was in the geometry creation math. XYZ, ANALYSIS, and OPT (as well as all the other codes in the package) use the same math in many instances. So they should come back with the exact same answer.

The same source code, generating code paths for SSE2 and (AVX .OR. AVX2 .OR. AVX512) will produce different sequences of, as well as different instructions that can produce different results.

Assume:

a) Single hardware thread (i.e. do not consider multi-thread issues for example)

b) you have a loop performing a summation of an array of REAL(8)

c) you compare results between SSE2 (2-wide REAL(8)) and AVX512 (8-wide REAL(8))

SUM = 0.0_8
DO I=1, 10000
  SUM = SUM + ARRAY(I)
END DO
PRINT *,SUM

SSE2 has two DP lanes: 0 and 1

Lane 0 will sum ARRAY(1) + ARRAY(3) + ARRAY(5) + ... + ARRAY(9999)

Lane 1 will sum ARRAY(2) + ARRAY(4) + ARRAY(6) + ... + ARRAY(10000)

And the end of the loop the two lanes are summed into a single result.

Note, each lane, when array values are not exact, will introduce rounding errors in each lane. And the final sum, will be the sum of two values with rounding errors. Additionally, summing numbers with differing binary exponents can introduce larger number of bits lost in the partial sum. IOW the sum using SSE2 may (can) differ from the sum using scalar (one at a time) summation.

Whereas AVXS512 has eight DP lanes: 0:7

Lane 0 will sum ARRAY(1) + ARRAY(9) + ARRAY(17) + ... + ARRAY(9993)

Lane 1 will sum ARRAY(2) + ARRAY(10) + ARRAY(18) + ... + ARRAY(9994)

...

Lane 6 will sum ARRAY(7) + ARRAY(15) + ARRAY(23) + ... + ARRAY(9999)

Lane 7 will sum ARRAY(8) + ARRAY(16) + ARRAY(24) + ... + ARRAY(10000)

Now, instead of two partial sums, you have eight partial sums, each potentially introducing different rounding errors, which are then summed together.

Scalar: one (potential) rounding error each cell in array in sequence

SSE2: two (potential) rounding errors, offset by 1 per lane, in array, every 2nd cell in sequence

AVX(1/2): four (potential) rounding errors, offset by 1 per lane, in array, every 4th cell in sequence

AVX512): four (potential) rounding errors, offset by 1 per lane, in array, every 8th cell in sequence

Depending on the values in the ARRAY, you may get the same result regardless of instruction set used (e.g. summing DP floating point whole numbers). More often, the DP values are approximate values (with some error) .AND. differ in magnitude (thus introducing additional loss of precision when summed).

Therefore, when testing code changes against a "standard" results data set, you must use the same instruction set to assure the correctness of your source code.

Then, for performance reasons, you can introduce the newer instruction set .AND, review (compare) the results data with the "standard" results to confirm that the differences (if any) are explained by a difference in the sequence of reduction. IOW the two differing results are consistent within the margin of error in the sequence of operation at the machine level.

Jim Dempsey

prop_design · ‎07-28-2022

wow, thanks Jim.

the mastery of Fortran and computers, of the people here, is amazing.

yeah, i have often looked into the precision of PROP_DESIGN, over the last 14 years. it is a really odd thing to get a handle on. once i start seeing a precision issue, it helps to run it as quad precision. that's about the easiest way to make sure it's a compute problem rather than a code problem. doing those comparisons and also making some simpler test codes has always shown me how hard it is to quantify what is going on. it's interesting to learn that compiler options matter too. that's not something i took into account when i started this project.

one recent test i did showed the following:

THEORETICAL SINGLE PRECISION LIMIT; 1.0E-7

THEORETICAL DOUBLE PRECISION LIMIT; 1.0E-15

THEORETICAL QUAD PRECISION LIMIT; 1.0E-33

DOUBLE PRECISION OPTIMIZATION ITERATION TOLERANCE LIMIT; 1.0E-9 (NOTE; 1.0E-11 DOES NOT ALWAYS SOLVE, DUE TO COMPUTING ERROR)

OUAD PRECISION OPTIMIZATION ITERATION TOLERANCE LIMIT; 1.0E-28 (NOTE; 1.0E-29 DOES NOT ALWAYS SOLVE, DUE TO COMPUTING ERROR)

RATIO1 = 15/7 = 2.1

RATIO2 = 33/15 = 2.1

RATIO3 = 28/9 = 3.1 (YOU MIGHT EXPECT 2.1 OR RATIO2 = 18.9/9 )

So it's a hard topic to get a hold of. I usually, just have to live with whatever it is doing. There are some other oddities I have found that is often hard to find simple explanations of. But one is, if an input ends in 0.1, 0.3, etc (odd numbers) then it will not converge to the same amount of significant digits as when they end in even numbers (0.2, 0.4, etc). Usually, the explanations are way more complicated than just saying the above. Then there are the supposed limits of significant digits (1e-15, 1e-33). To me, there is no way you can say that. It would seem to have to do with the number of computations, at a minimum. Then you would still not know because you have no idea of which numbers are ending in odd or even digits. So ultimately, it is a dead end. However, it looks like when I compute in double precision I can at least reliably know that I'm good to what single precision is supposed to be. The oddly, quad precision goes more than double that. So you would think if you compute in quad precision, by the same logic, you would be good to double precision. However, I am seeing I am good to beyond double precision. Of course, since hardware doesn't work in quad precision, I only do that for testing. I think it would be a lot better if they invent a computer that doesn't have the issue of not being able to represent all numbers. Then there would be no such error at all. It's scary to have an error that you can't really quantify and then they through out these rules as if they are true, when they are not. Like you get 1e-15 significant digits for double precision. Sure, but only for one number comparison, which is useless. In a real code, you have no idea of what the precision is. That's the scary part. I kind of have to laugh with the new craze of computing single precision on graphics cards and acting like this is some great new invention. Then they use it with CFD, which has never been any good to begin with. It's rather comical but sad at the same time.

Oh, on the main topic of this thread. I asked a user of my code if he had a version from a month ago and he does. So I'm hoping I might be able to produce an example of what I had seen. Still to early to know for sure.

jimdempseyatthecove · ‎07-28-2022

The precision you listed are for numbers between (approximately) 0.5 and 1.5.

IOW single precision, DO NOT ASSUME 1.0E-7 is approximately the smallest number you can add to any single precision number and then see a difference.

Instead use:

SmallestNumberForNumberOfInterest = (NumberOfInterest) * 1.0E-7

However, it is recommended that you use the EPSILON function to produce the proper (most correct) value for the variable type.

SmallestDeltaToFoo = Foo * EPSILON(Foo)

Note, for convergence routines, the SmallestDeltaToFoo (your number of interest), might need to be somewhat larger, say 2x this or so. Otherwise the convergence routine might fail to converge (flipping back and forth forever across the convergence point).

Jim Dempsey

prop_design · ‎07-29-2022

I looked into this some more and I still can't duplicate what I thought was the problem. I'm pretty sure I made a mistake when doing the original comparison. The changes with compiler options are not between codes, like I originally thought. If you change the compiler options, the results can be slightly different. However, that difference is below my iteration tolerance. So it's not an issue. If you compile both codes with the same options, they yield the same results.