non-repeatability of code results

jespersen · ‎02-08-2008

I have a small test case for a large CFD (Computational Fluid Dynamics) code
which can give slightly different answers if I run it multiple times (the differences
are in the low order bits). This is on quad-core Xeon (X5355) with ifort 10.1.

Typically non-repeatability might be due to something not being initialized, but I
don't think that's the case here. This code gives repeatable results on other platforms.
Also this code gives repeatable results if I compile with "-fp-model precise" or if I compile
with -O1. For one set of tests I always used
-pad -align -auto -ftz -fpe0 -IPF-fma -IPF-fp-relaxed -w -traceback -xT
and I got the following
Add'l Options Result
-O3 -ip -xT nonrepeatability
-O3 -ip nonrepeatability
-O3 nonrepeatability
-O2 nonrepeatability
-O1 repeatability
-O1 -ip repeatability
-O1 -ip -xT repeatability
I seemed to get nonrepeatability if and only if "LOOP WAS VECTORIZED" messages appeared during compilation.

When I say nonrepeatability I mean that if I run the code 8 times (say), I
might get 3 different solutions of the 8 total solutions. Any differences are
less than unit roundoff. This happens in both single and double precision.
Each run takes less than 1 second.

This is a Linux system, "uname -snrmpio" returns
Linux service0 2.6.16.53-0.16-smp x86_64 x86_64 x86_64 GNU/Linux

I am actually not too worried about this, but I am really curious as to what
could cause a code not to give the same answer each time you run it. Does
anyone have an idea what the cause could be?

TimP · ‎02-08-2008

You have quoted options which apply only to the IA-64 compiler; apparently that doesn't matter. I'm not sure of the combined effect of some of the unusual options you have used.
It looks as if the differences are associated with alignment of arrays used in vectorized sum reduction. This shouldn't happen in x86-64 (with 64-bit compiler; you didn't specify that), unless you use a non-standard method to allocate memory, since all automatic alignments should be at consistent 16-byte boundaries.
It is common on 32-bit OS, where there is no consistent alignment of dynamic allocations. The vectorization of these operations changes the order of additions, usually for the better, but with dependency on alignment.
As you noticed, -fp-model precise removes the alignment-sensitive vectorizations.
Removing -auto might improve consistency of alignment; if it does so, you could make a case there is a bug in -auto. Similarly with the other options which purport to affect alignment (pad, align).
I don't like -xT, as it seems often to reduce performance (compared with -xW or -xP), but I don't know that it could have the effect you mention.
It is quite likely that the differences you see are much less than the round-off errors in your analysis, but I agree that they could indicate something amiss, if seen when using all standard 64-bit tools.

oh_moose · ‎02-09-2008

Have you tried adding the qualifier -ftrapuv yet? Steve wrote a while ago that this option does not work too well. But I would give it a bit more credit. It did spot an uninitialized integer variable in one of the legacy libraries which I have to use. So it is not completely useless. It could also help you. With this option all variables are being initialized with a constant value at the beginning of a routine. If you get wtih this option repeatability, then chances are that somewhere you have a problem with an uninitialized variable (I realize you claim you do not have that kind of problem). Unfortunately -ftrapuv causes other problems, so you should only use it for testing [1].

BTW using the qualifier -fpe0 in addition to -ftrapuv will not help you, because the Intel Pentium does not trigger an exception if an operand of a floating point operation is already an NaN (not a number). The exception is only being raised if the result is an NaN and all of the input operands were valid floating point numbers. Too bad. The Alpha and VAX processors do not have this restriction and therefore also catch a GIGO (garbage in garbage out) situation.

[1] TRACEBACKQQ, missing line information; crash
http://software.intel.com/en-us/forums//topic/56846

jespersen · ‎02-12-2008

Thanks to those who responded; I tried some of the suggestions but with
no change in the outcome.
I did manage to produce a quite small test code that shows the problem.
I can't post the code here because it needs a data file (670KB). But if
anyone is interested in this send me e-mail and I will send you the code and
the data file.

Here is the README from my little test code:
Illustrating non-repeatability: how to run a program two times and get different results.

Required files:
Source code: driver.F
Data input file: flowvars.dat (this is little-endian)

Compile via
ifort -O3 -pad -align -auto -xT -o driver driver.F

Just run driver; it reads flowvars.dat, does some computation, and
writes a file vflj.dat.

Run driver several times, doing cksum on vflj.dat each time. You probably will
find different checksums. So the code can get different results if you run it
multiple times. You do not necessarily get a different result every time you run
the code, but if you run the code 8 times (say) you might get 3 different checksums.
The little script dodriver runs the code 8 times and computes cksum each time.

If you compile via adding the "-fp-mode precise" option, i.e.,
ifort -O3 -pad -align -auto -xT -fp-model precise -o driver driver.F
then you should get exactly the same file vflj.dat every time you run driver.

As a further point of interest, if you comment out the body of the
"ELSE" clause in subroutine vflj (roughly lines 202 through 271) then
the code gives repeatable results, even though this block of code
is not executed since the input variable VISCK to the subroutine is .TRUE.
It seems that commenting out code that is not executed can change
the output. Strange...

Further notes:
compiling with "ifort -O3 -xT -o driver driver.F"
gives nonrepeatability.

Ron_Green · ‎02-12-2008

Well I am curious, but before I dive in I suspect we need to set expectations. I hope I don't come off as a know-it-all but ...

Optimization and numerical accuracy are at odds with each other. For some codes, not a problem, others are numerically sensitive.

The -O and -x options affect optimization. That your code gives different results at optimization AND may have small differences from run to run doesn't surprise me. Does it surprise you? I see you use -align, but remember, with no arguments this only affects common blocks and structures and not dynamic data or array temporaries.

The -fp-model option is used to control numerics. So if your code is sensitive, use it. Try the less-performance affection arguments such as -fp-model source until you balance performance and numerical accuracy.

Given all the above, what is your objective for this code? Do you want best performance, or best accuracy? OR would you like best performance possible with reproducibility? We can find the sweet spot once we know what it is you're looking for.

ron

jespersen · ‎02-13-2008

"The -O and -x options affect optimization. That your code gives different results at optimization AND may have small differences from run to run doesn't surprise me. Does it surprise you?"

Yes it surprises me that I can run the same code twice and get (slightly)
different answers. I am not comparing one machine to another, I am not
comparing different compiler options, rather the machine and compiler
options are fixed. Shouldn't the output stay the same from run to run?

By the way, some of the compiler options are relics from the options for
the whole code. In this small test case, just "ifort -O3" is enough to
demonstrate nonrepeatability. Also, the data file compresses nicely
and a gzipped tar file of everything is 57KB, in case anyone would like
to get this by e-mail.

Ron_Green · ‎02-13-2008

OK, I'll contact you and look into your code. We'll see what it is that is causing the non-repeatability.

ron

TimP · ‎02-13-2008

Original poster has continued to decline to state whether the 64-bit ifort is in use. If so, it would be of interest to other customers to root out these variations. If the 32-bit system is in use, there is not much hope, in view of the looser alignments.

jespersen · ‎02-13-2008

Sorry, I don't know exactly what you mean by
"whether the 64-bit ifort is in use".

"ifort -V" returns
Intel Fortran Compiler for applications running on Intel 64, Version 10.1 Build 20071116 Package ID: l_fc_p_10.1.011

Does that mean it's the 64-bit ifort?

Ron_Green · ‎02-13-2008

that's the 64 bit compiler.

I have been trying to contact you via this 'contact' 'email' capability on this forum, but no luck. Seems it's down at the moment (problem report filed).

Do you want to try to upload the tarball and gzipped input file?

ron

jespersen · ‎02-13-2008

I can certainly try to upload the tarball, but I don't
see any way to do it via the forum.

Or I can send it to you in an e-mail
(you can contact me directly at Dennis.Jespersen@nasa.gov)

dkokron · ‎02-13-2008

I've seen non-repeatability when using the -Xp option under the following compiler.

Intel Fortran Compiler for Intel EM64T-based applications, Version 9.1 Build 20060707 Package ID: l_fc_c_9.1.036

Tom Clune from Goddard's NCCS tracked this to some array alignment problem. Here is his reproducer.

cat tester.F90

! Demonstrates problem in the Intel compilers -xP flag.
! Different results depending on location on cache line.

program vectorProblem

integer, parameter :: n = 100000
integer, parameter :: pad = 63
real(kind=4), allocatable :: x(:), y(:)
real(kind=4) :: s
integer :: i, offset

allocate(x(n+pad), y(n))
call random_number(y)
y = y - sum(y)/n

do offset = 0, pad
x(offset:offset+n-1) = y
print*, offset, sum(x(offset+1:offset+n))
end do

end program vectorProblem

Some description from Tom

Note that I'm not getting
variation from run-to-run, but rather have arranged to perform the
same sum with arrays that start at different locations in cache. If
you take out the -xP, each offset gets the same answer.

Typical output without -xP
0 0.6851280
1 0.6851280
2 0.6851280
3 0.6851280
4 0.6851280

typical output with -xP

0 0.4706955
1 0.4706761
2 0.4706561
3 0.4706792
4 0.4706955

Dan

jespersen · ‎02-14-2008

Ron Green of Intel sent e-mail which I believe explains everything.
He said

... typically this is due to alignment of data.
What happens is that the SSE vectorization must have 16word alignment.
SO if your arrays start at an exact boundary and end at an exact
16word boundary, the entire vector/matrix is processed in the SSE
hardware. If the array starts before a 16word boundary or ends after
a boundary, those elements are processed serially in the X87 hardware.
There are slightly different FP semantics between the two. The crux
of the matter is that dynamic allocated data structures are not
guaranteed to be on 16word boundaries. Thus, from run to run your
dynamic data will have differing alignments, hence different FP
hardware handling the data preceding and following your 16-word
aligned vectorized data structures.

Very clear. Thanks, Ron.

TimP · ‎02-14-2008

A portable method for getting 16-byte aligned data is needed, and we've been told it was the default for the 64-bit platform. Clearly, it's not the default for 32-bit OS.
Fixing this will not make the serial result the same as the vector result, but it would avoid the non-repeatability.