- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am running a large program in Win XP x64 Platform. The machine has two Xeon Dual Core Processors.
I am using the following compilation flags.
/nologo /O3 /Og /Qparallel /Qpar_threshold:20 /assume:buffered_io /module:"Release\" /object:"Release\" /libs:static /threads /c /QaxTPNS
What I have observed is that when I run the exe with the same input files, I get slightly different answers (usually in the last significant digit of the printout -- something like 1.0E-06). When I do a diff on the output file, this shows up. I would like to avoid this sort of behavior. I need to get results that can be repeated (if possible across multiple hardware).
I am guessing that the compiler is doing agressive optimization. Is there any specific flag that I should (add / omit) to prevent this ?
I am using the following compilation flags.
/nologo /O3 /Og /Qparallel /Qpar_threshold:20 /assume:buffered_io /module:"Release\" /object:"Release\" /libs:static /threads /c /QaxTPNS
What I have observed is that when I run the exe with the same input files, I get slightly different answers (usually in the last significant digit of the printout -- something like 1.0E-06). When I do a diff on the output file, this shows up. I would like to avoid this sort of behavior. I need to get results that can be repeated (if possible across multiple hardware).
I am guessing that the compiler is doing agressive optimization. Is there any specific flag that I should (add / omit) to prevent this ?
Link Copied
3 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - ragu
I am running a large program in Win XP x64 Platform. The machine has two Xeon Dual Core Processors.
I am using the following compilation flags.
/nologo /O3 /Og /Qparallel /Qpar_threshold:20 /assume:buffered_io /module:"Release" /object:"Release" /libs:static /threads /c /QaxTPNS
What I have observed is that when I run the exe with the same input files, I get slightly different answers (usually in the last significant digit of the printout -- something like 1.0E-06). When I do a diff on the output file, this shows up. I would like to avoid this sort of behavior. I need to get results that can be repeated (if possible across multiple hardware).
I am guessing that the compiler is doing agressive optimization. Is there any specific flag that I should (add / omit) to prevent this ?
I am using the following compilation flags.
/nologo /O3 /Og /Qparallel /Qpar_threshold:20 /assume:buffered_io /module:"Release" /object:"Release" /libs:static /threads /c /QaxTPNS
What I have observed is that when I run the exe with the same input files, I get slightly different answers (usually in the last significant digit of the printout -- something like 1.0E-06). When I do a diff on the output file, this shows up. I would like to avoid this sort of behavior. I need to get results that can be repeated (if possible across multiple hardware).
I am guessing that the compiler is doing agressive optimization. Is there any specific flag that I should (add / omit) to prevent this ?
Don't neglect to set an appropriate value for affinity, such as
SET KMP_AFFINITY=compact
so as to get more consistent results, both in performance and data alignment.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - ragu
I am running a large program in Win XP x64 Platform. The machine has two Xeon Dual Core Processors.
I am using the following compilation flags.
/nologo /O3 /Og /Qparallel /Qpar_threshold:20 /assume:buffered_io /module:"Release" /object:"Release" /libs:static /threads /c /QaxTPNS
What I have observed is that when I run the exe with the same input files, I get slightly different answers (usually in the last significant digit of the printout -- something like 1.0E-06). When I do a diff on the output file, this shows up. I would like to avoid this sort of behavior. I need to get results that can be repeated (if possible across multiple hardware).
I am guessing that the compiler is doing agressive optimization. Is there any specific flag that I should (add / omit) to prevent this ?
I am using the following compilation flags.
/nologo /O3 /Og /Qparallel /Qpar_threshold:20 /assume:buffered_io /module:"Release" /object:"Release" /libs:static /threads /c /QaxTPNS
What I have observed is that when I run the exe with the same input files, I get slightly different answers (usually in the last significant digit of the printout -- something like 1.0E-06). When I do a diff on the output file, this shows up. I would like to avoid this sort of behavior. I need to get results that can be repeated (if possible across multiple hardware).
I am guessing that the compiler is doing agressive optimization. Is there any specific flag that I should (add / omit) to prevent this ?
A variation in output with the same input suggests an uninitialised variable (which will tend to get different starting values each time), but to get the small change you are seeing the variable would have to be one with little influence on the results. It would be worth turning on the compiler checking for uninitialised variables just to be safe.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My guess is /Qparallel is parallizing a section of code that is sensitive to order of execution. Your round off errors are occuring in different places.
By examining which results are different you may be able to determine which code is sensitive to processing order. And a simple correction method would be to turn off /Qparallel for those routines.
However, if may be beneficial to look at your code to determine if you can make it less sensitive to sequencing. And thus, make better use of all your cores.
As an example of sequencing issues:
Assume you have a large array of REAL(8)'s.
Assume that the precision is taxed (numbers stored in array are approximate due to prior bit round offs)
If you were to sum this array using 1 thread, thesummation of theseapproximate numbers would encounter further round off errors at specific indexes into the summation. Some additions would be exact, while others will have roundoffs. However, if this single thread were to re-run, the same sum would result because the same errors occure in the same sequence during the summation.
Now, if you were to sum this array using 2 threads, one working on 1:N/2 and the other woking on N/2+1:N, the first thread would encounter the same roundoff errors at the same indexes up to it's termination point of N/2. The second thread though, is starting out with a partial sum of 0.0 at N/2+1 and will thus encounter its own set of round off errors at different indexes and with different round off results from those of a single thread processing the loop. The sum of these two partial sums may indeed be different in the least significant bits.
If you want consistancy between running 1 thread as opposed to running n threads, then the code and data must be examined to see if it is sensitive to being run in pieces as opposed to being run all at once. When it does, then consider adding code after it performs the reduction (same spot even with 1 thread) where you code to produce a diminished precision but where it is consistent. e.g. on a 16 core machine, consider rounding off the least significant 5 bits. You will need to test this.
Jim Dempsey

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page