Different Answer on the Same Machine

rasa · ‎03-09-2009

I am running a large program in Win XP x64 Platform. The machine has two Xeon Dual Core Processors.

I am using the following compilation flags.

/nologo /O3 /Og /Qparallel /Qpar_threshold:20 /assume:buffered_io /module:"Release\" /object:"Release\" /libs:static /threads /c /QaxTPNS

What I have observed is that when I run the exe with the same input files, I get slightly different answers (usually in the last significant digit of the printout -- something like 1.0E-06). When I do a diff on the output file, this shows up. I would like to avoid this sort of behavior. I need to get results that can be repeated (if possible across multiple hardware).

I am guessing that the compiler is doing agressive optimization. Is there any specific flag that I should (add / omit) to prevent this ?

TimP · ‎03-09-2009

Quoting - ragu

I am running a large program in Win XP x64 Platform. The machine has two Xeon Dual Core Processors.

I am using the following compilation flags.

/nologo /O3 /Og /Qparallel /Qpar_threshold:20 /assume:buffered_io /module:"Release" /object:"Release" /libs:static /threads /c /QaxTPNS

What I have observed is that when I run the exe with the same input files, I get slightly different answers (usually in the last significant digit of the printout -- something like 1.0E-06). When I do a diff on the output file, this shows up. I would like to avoid this sort of behavior. I need to get results that can be repeated (if possible across multiple hardware).

I am guessing that the compiler is doing agressive optimization. Is there any specific flag that I should (add / omit) to prevent this ?

If you are compiling for 32-bit mode on 64-bit OS, try the 64-bit compiler.That should remove many variations associated with data alignments. I think the next most likely option to add would be /fp:source, to turn off those vectorizations which have data alignment dependencies. You might try raising the par_threshold value, at least until performance begins to drop. There's no guessing about it, par_threshold:20 is extremely aggressive. Cut back on the number of different CPU types you requested to be optimized, noting that the oldest option for Intel 64-bit is P (no N) and that T seldom adds anything useful, and that each additional option adds potential variations among differing hardware. If you put in a to cater for AMD hardware, but your AMDs are new enough for the O option, use that rather than P.
Don't neglect to set an appropriate value for affinity, such as
SET KMP_AFFINITY=compact
so as to get more consistent results, both in performance and data alignment.

gib · ‎03-09-2009

Quoting - ragu

I am running a large program in Win XP x64 Platform. The machine has two Xeon Dual Core Processors.

I am using the following compilation flags.

/nologo /O3 /Og /Qparallel /Qpar_threshold:20 /assume:buffered_io /module:"Release" /object:"Release" /libs:static /threads /c /QaxTPNS

What I have observed is that when I run the exe with the same input files, I get slightly different answers (usually in the last significant digit of the printout -- something like 1.0E-06). When I do a diff on the output file, this shows up. I would like to avoid this sort of behavior. I need to get results that can be repeated (if possible across multiple hardware).

I am guessing that the compiler is doing agressive optimization. Is there any specific flag that I should (add / omit) to prevent this ?

Do you get a variation in results when you restrict the program to execution on a single CPU?
A variation in output with the same input suggests an uninitialised variable (which will tend to get different starting values each time), but to get the small change you are seeing the variable would have to be one with little influence on the results. It would be worth turning on the compiler checking for uninitialised variables just to be safe.

jimdempseyatthecove · ‎03-10-2009

My guess is /Qparallel is parallizing a section of code that is sensitive to order of execution. Your round off errors are occuring in different places.

By examining which results are different you may be able to determine which code is sensitive to processing order. And a simple correction method would be to turn off /Qparallel for those routines.

However, if may be beneficial to look at your code to determine if you can make it less sensitive to sequencing. And thus, make better use of all your cores.

As an example of sequencing issues:

Assume you have a large array of REAL(8)'s.
Assume that the precision is taxed (numbers stored in array are approximate due to prior bit round offs)

If you were to sum this array using 1 thread, thesummation of theseapproximate numbers would encounter further round off errors at specific indexes into the summation. Some additions would be exact, while others will have roundoffs. However, if this single thread were to re-run, the same sum would result because the same errors occure in the same sequence during the summation.

Now, if you were to sum this array using 2 threads, one working on 1:N/2 and the other woking on N/2+1:N, the first thread would encounter the same roundoff errors at the same indexes up to it's termination point of N/2. The second thread though, is starting out with a partial sum of 0.0 at N/2+1 and will thus encounter its own set of round off errors at different indexes and with different round off results from those of a single thread processing the loop. The sum of these two partial sums may indeed be different in the least significant bits.

If you want consistancy between running 1 thread as opposed to running n threads, then the code and data must be examined to see if it is sensitive to being run in pieces as opposed to being run all at once. When it does, then consider adding code after it performs the reduction (same spot even with 1 thread) where you code to produce a diminished precision but where it is consistent. e.g. on a 16 core machine, consider rounding off the least significant 5 bits. You will need to test this.

Jim Dempsey