We're trying to move a stochastic Fortran program to OpenMP with XE 2013 in Win 7 using Visual Studio. Basically, we want to run many copies of a program, after the initial read-in of tables while sharing a couple of the large (basically read-only) tables between the threads. In the simplest configuration, two large do loops, with subroutines and modules, are completely enclosed in just a parallel region, firstprivate, except for a couple of shared arrays. In this version, after entering the parallel region, the threads never leave it. We've tried more complicated uses of OpenMP but, in all cases, we get only modest improvements...i.e., a number of threads running but very lilttle improvement in total accomplishment in wall time compared to only one thread.
Finally, we're getting substantial improvement (e.g., four threads-worth of work in only twice the wall time of one thread) HOWEVER, it only occurs if the code is run in VTune locks and waits (x64, in either debug or release mode). If, immediately after running in this mode, we run the same program (cntl-f5) w/o VTune, it reverts to very lttle gain (i.e., the same amount of work in the same amount of wall time, no matter how many threads are running).
Here's the question: What is VTune Locks and Waits doing that is speeding things up?
We judge the success by the quantity and quality of the output so nothing is being missed in the execution. Also, Locks and Waits is not completely consistent, sometimes it, too, runs slowly. Whichever mode (fast or slow) it starts in, it continues indefinately. This has been observed on both i7 and an E5 machines. There may be an issue of the order of running w/ and w/o Locks and Waits and compiling but we have not been able to pin down any consistant behavior in that regard.
We're hoping that whatever Locks and Waits has discovered, we can use to achieve the speed ups we need to move this to a MIC.
Bruce Weaver wrote:I think that you are using Intel(R) Paralle Studio XE 2013 SP1, but performance is poor to build for old Xeon. Maybe you missed some advanced compiler's options but I don't know, my opinion is to submit this problem to Intel(R) C++ Compiler forum - http://software.intel.com/en-us/forums/intel-c-compiler for helps.
OK, I upgraded from the initial 2013 release to the second (119). The code now works correctly on a 3rd gen I7 but, when I move the executable (either debug or release) w/o recompiling to an E5 or an older Xeon, it runs even slower than before. Recompiling on the E5 w/ the latest compiler and Vtune gives the same results as before...very poor performance except, sporadically, when run w/ locks and waits. I don't see how we're going to have any confidence in moving this to some MICs until we can straighten this out. What ideas do you have as we're pretty much running out of ideas here?