App with SSE2 enabled creates unstable results from time to tim

a_zhaogtisoft_com · ‎08-16-2011

I have been trying to figure out what is really wrong with my code or compiler setting.

Back to more than 1 year ago when I tried to convert our company dev tool to ifort 11.1 with SSE2, I got a big surprise: the Q/A engineers reported that testing done on Windows showed many many simulation models exhibited inconsistent results, a.k.a., when he ran the same model with the same excutable multiple times, say, 6 times in a row, he found there might be one run that the result was different from rest of the runs. It caught me by big surprise because I did not run into the same issue when I did my debugging/profiling with the same compiler version on Linux. But then, we found the issue too close to our software release date, eventually, we have to disable SSE2 completely for the software release. (/arch:ia32).

After that, I put in more code policing effort, trying to avoid (RUI/RUA - read from un-inited, read from un-allocated) error using SunStudio's dbx memory checking. The effort greatly reduced the case count of such unstable runs. But there are still some cases that I could not find anything wrong with SunStudio's dbx).

Right now, I am trying to push the company to adopt ifort v12 with SSE2. But I encount the same issue once again:

1) On linux, the same compiler runs ok, no unstable results found in all of our testing suite. But on Windows, we found the same codes show result changes in about 30+ of our testing models.

ifort options for Linux/x86:
-extend_source 132 -r8 -assume underscore -fPIC -align dcommon -fpp
ifort options for Windows/x86:
/nologo /assume:buffered_io /fpp /real_size:64 /align:rec8byte /align:dcommons /align:sequence /iface:cvf /libs:static /threads

2) On the other hand, the Windows debug build (/debug:full /Od /arhc:SSE2) shows no unstable results.

3) The Windows release build with extra /Qsave compiler option still have the same unstable issue.

4) The Windows release build with extra compiler option '/fp:precise', the simulation result is stable, but it is 10% slow down than the solver with /arch:ia32 - therefore, this is not even an option for me.

For many of the models which shows un-stable result, I actually build a solver under Sun/SPARC, so that I can use SunStudio's dbx for memory checking, and no RUI/RUA detected so far.

So, it has been very depressing though-out the process. I reported the issue to Intel premium support, but was asked for a simple test case. Well, our code is huge, very complicated. If we can not pin-point such transient unstable problem, a test case is just out of the question. Intel support did suggest me to turn on error check, which did, but no useful stuff reported.

So, any suggestion as to how to proceed further?

TimP · ‎08-16-2011

You ought to be able to set /assume:protect_parens in addition to your preferred options in case that improves stability.
According to what you said, I would guess you are running the ia32 compiler, possibly not the latest OS and Visual Studio, and are running into the data alignment dependence of vectorized sum or dot product reduction. In that case, intel64/X64 mode ought to improve on the situation. Otherwise, you are stuck with attempting to get consistent run-time alignment of the critical arrays. If improvements in the latest Microsoft components don't help, you might have to check alignment at run time and adjust out the variations.

a_zhaogtisoft_com · ‎08-16-2011

Tim,

You guess is right, at the moment, we only use ia32 compiler, not sure if we want to support 64-bit native application at all unless we see big runtime performance jump. My own build machine is Windows Vista (32bit), Visual Studio.

Are you suggesting that the build environment (OS, VS, etc) could have some impact on runtime behavior? I am confused, I would think that OS' may have more impact on the run-time alignment, but not the building enviroment. Actually, I am more mystified by the fact that Linux 32-bit build runs perfectly stable, but only Windows release build does not.

Could you tell me more about your last statement (how to check alignment at run time and adjust out the variations)? That sounds intriguing. Thanks.

Allen

jimdempseyatthecove · ‎08-16-2011

Allen,

The linker is responsible to align the segments to the specified alignment interval (assuming you've specified alignments). This will affect the alignment of static structures (common's and module data). Fortran calls upon the C rutime library. If your code has !DEC$...ALIGN... this should cause the compiler, for allocations from heap, to call the aligned malloc, and for stack (local variables) the alignment will involve two assembly instructions to insert a padd when necessary. To get the alignment correct

a) the compiler has to be working correctly
b) the C runtime library has to be working correctly
c) the linker has to be working correctly
d) your code must attribute thedata to get alignment desired
e) your code must not include statements to the compiler to state the data is aligned when in fact it has not been declared aligned.

If you continue to observe what you suspect are alignment issues then either one of the above is true or one of your runtime libraries assumes one of the above is true or you have a compiler bug.

BTW is your program multi-threaded? If so the use of and non-use of SSE will affect the timing of different threads. Therefore, changing the timing may expose a defect in the code that is otherwise not observed.

RE: 32-bit vs 64-bit

The performance issue will depend on your application. 32-bit will generate smaller code and have lesser demands on cache for reading/writing array descriptors stack accesses. This smaller size improves the instruction cache hit ratio as well as L1/L2 into stack variables. However, on the 64-bit side you have twice the number of registers (both FP and integer). Soyour subroutines can experience speed-up due to more registers available. You will have to run benchmarks to see what happens.

Jim Dempsey

TimP · ‎08-16-2011

Until very recently, 32-bit Windows assured only 4-byte alignments. Thus, it was possible to have as many as 4 numerically different results from a vector sum reduction, where the loop is adjusted at run time by starting out with 0 to 3 scalar iterations (single precision) so as to reach a point where the operand is 16-byte aligned. In double precision, there would be just 2 possible results, most likely varying only in the 15th decimal. 64-bit Windows always gave 16-byte alignments, and should eliminate this source of numerical variation (unless, possibly you run AVX code). I saw an announcement that VS2010 might give higher default alignments, even in 32-bit run-time, but I've seen no report verifying this. Thus the compiler treatment which turns off vectorization of those loops when /fp:source and the like are set.
Ideally, ALLOCATABLE arrays would be set up so as to give 16-byte alignments at run time. That would be the simplest remedy, if it works, and even that is more effort than many people are willing to make. You can check alignment of an array at run time by the standard C_LOC or legacy LOC intrinsics; if necessary, you could copy an array so as to align it, but you wouldn't do that in an inner loop. For several years, ifort has aligned local arrays which are big enough to be vectorization candidates (at the possible expense of some wasted storage for each array).
If the results differed more than expected with changes in alignment, it might be taken as a warning of numerical instability.

jimdempseyatthecove · ‎08-17-2011

TimP,

Thanks for the assessment.

Isn't this a problem then of the IVF runtime library assuming alignments when alignments not in control by the app?
IOW shouldn't the code generate two paths (with runtime test)? Even with aligned allocations a programmer may need to call a function using a slice of an array (which is not aligned).

Jim Dempsey

Steven_L_Intel1 · ‎08-17-2011

It does generate two code paths, and that's the problem. Results can vary a bit depending on which slice of the input gets vectorized and which is done one at a time.

a_zhaogtisoft_com · ‎08-19-2011

I have been business to digest all I have heard here. This is what I have done:

In my fortran/C project (SSE2 enabled), I turned on 2 more switches: /warn:all and /check:all, one for compile time and one for runtime. It turns out there are so many errors reported, presummably by /warn:all. Some typical errors are following:

error #6634: The shape matching rules
of actual arguments and dummy arguments have been violated.   [CBLTPRP]

error #7836: If the actual argument
is scalar, the corresponding dummy argument shall be scalar unless the actual
argument is an element of an array that is not an assumed-shape or pointer
array, or a substring of such an element.   [XX]

error #8000:  There is a
conflict between local interface block and external interface block.

error #8284: A scalar actual argument
must be passed to a scalar dummy argument unless the actual argument is of type
character or is an element of an array that is neither assumed shape nor
pointer.   [X1]

error #5508: Declaration of
routine 'RTIME_S' conflicts with a previous declaration

I have been chasing developers for these error reports, and a solver build is not even possible at the moment with the /warn:all. (with /warn:custom the build is fine and have working).

Some of the errors, developer told me they are perfect legal, for example, in subroutine argument, he declared
dimension X(*)
but he did not pass an array intothis subroutine, instead he used a scaler. I can understand that fortran passes everything by reference, and if the user only care about the first element of the array, passing a scaler when an array is declared maybe ok.

Here is my question, case like above, will it contribute to whatever alignment you guys are discussing?

I ported our VS 2005 projects to VS 2010, hoping that the latest linker can help. Here is the preliminary result: we already see some cases of result inconsistensy, though the tester told me that it looks like the total case count is reducing - he think that VS 2010 linker probably does not do much, but the code cleanup I forced developers to go through may have helped.

Anyway, just want to have a head up.

Some answers to other questions:

1) our application is really single thread application, especially in current context.

2) I do not think we purposely align anything. This is a fortran developer house (though I am a C/C++ programmer knowing a lot of F90), developers will just write whatever way it works. I do not see anything such as !DEC$...ALIGN. The ony !DEC$ I have seen are those subroutine to be exported.

TimP · ‎08-19-2011

Quoting a.zhaogtisoft.com

Some of the errors, developer told me they are perfect legal, for example, in subroutine argument, he declared
dimension X(*)
but he did not pass an array intothis subroutine, instead he used a scaler. I can understand that fortran passes everything by reference, and if the user only care about the first element of the array, passing a scaler when an array is declared maybe ok.

Here is my question, case like above, will it contribute to whatever alignment you guys are discussing?

This one has been discussed at more length previously on these forums. It works OK on ifort, but may fail on other platforms (as it would have on the platforms I learned on many years ago). It won't affect your numerical results with ifort, provided that you ensure separately that the program accesses at most the single array element. With only 1 element, you won't reach the code which is sensitive to alignment. If you are picking up more than one element when your caller has defined only one, that would be a source of non-repeatability.
Speaking editorially, if you are concerned about exact numerical repeatability, it's a good idea to ensure as much as possible that your code is clean according to standards.

App with SSE2 enabled creates unstable results from time to time