Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)

What is the secret of Locks and Waits?

Bruce_Weaver
Beginner
576 Views

Hi,

We're trying to move a stochastic Fortran program to OpenMP with XE 2013 in Win 7 using Visual Studio.  Basically, we want to run many copies of a program, after the initial read-in of tables while sharing a couple of the large (basically read-only) tables between the threads.  In the simplest configuration, two large do loops, with subroutines and modules, are completely enclosed in just a parallel region, firstprivate, except for a couple of shared arrays.  In this version, after entering the parallel region, the threads never leave it. We've tried more complicated uses of OpenMP but, in all cases, we get only modest improvements...i.e., a number of threads running but very lilttle improvement in total accomplishment in wall time compared to only one thread.

Finally, we're getting substantial improvement (e.g., four threads-worth of work in only twice the wall time of one thread) HOWEVER, it only occurs if the code is run in VTune locks and waits (x64, in either debug or release mode).  If, immediately after running in this mode, we run the same program (cntl-f5) w/o VTune, it reverts to very lttle gain (i.e., the same amount of work in the same amount of wall time, no matter how many threads are running).

Here's the question: What is VTune Locks and Waits doing that is speeding things up?

We judge the success by the quantity and quality of the output so nothing is being missed in the execution.  Also, Locks and Waits is not completely consistent, sometimes it, too, runs slowly.  Whichever mode (fast or slow) it starts in, it continues indefinately.  This has been observed on both  i7 and an E5 machines.  There may be an issue of the order of running w/ and w/o Locks and Waits and compiling but we have not been able to pin down any consistant behavior in that regard.

We're hoping that whatever Locks and Waits has discovered, we can use to achieve the speed ups we need to move this to a MIC.

thanks,

Bruce

0 Kudos
9 Replies
Peter_W_Intel
Employee
576 Views
LocksandWaits analyzer will monitor all Windows defined Thread APIs, sync objects, and supported Intel Threading Building Block and OpenMP program. Usually you can observe results - Hotspots in result, the sync objects which spent biggest "wait-time" - with "wait count info". You need to know when the sync-object spent much wait time, if other threads were working or in IDLE time (where you need to change code, so sync-objects will not be the bottlenecks). Secondary, you might reduce "wait count" in algorithm to reduce wait time. Sometime, we can do other ways to reduce wait time. One is to reduce (shrink) "critical code area" to reduce wait time, another is not to use the "lock" to block a big shared data - for example, a big array of structure OR a big structure of array. Indeed, one thread only accesses limited data in big shared data area, another thread should can use other elements in shared data. The developer can adjust data structure.... Locksandwaits analyzer has some overhead which records performance data when executing sync objects. So, we have to compare VTune results between program of "after code change" and program of "before code change". It is meaningless to compare performance data of program without using VTune, vs program with using VTune
0 Kudos
Bruce_Weaver
Beginner
576 Views
Hi Peter, I've been using Locks and Waits for a couple weeks now and I think I understand its basic functions. I'm running with four OMP threads in this case and L&W shows five threads. The three workers plus the startup are solid CPU and the two summary histograms are solid at four running threads/logical cpus. The _kmp_launch_monitor is solid wait but I guess that is what it does. There are no transitions other than the ones at startup. So everyone seems happy except that adding more CPUs does very little good EXCEPT when run under L&Ws. I would not think about comparing the efficiency of the program w/ and w/o L&W EXCEPT for the fact that, when running in L&W, the program is about 1.7 times more efficient than if it is run w/o L&W. I don't think that is meaningless when it means I can get my results in three days rather than five. L&W is doing something that lets the program have the gains in efficiency that I expected when we started the conversion. It must be doing something differently albeit sporadically. I'm hoping this quirk will give us some guidance on discovering why, in general, we are not able to achieve any significant gains even though hotspots, L&W, and some of the other diagnostic tools are telling us that all the threads are not spinning or waiting but we are not gaining any in efficiency (throughput/wall time). At the moment, in an effort to debug this problem, the only array that is shared is the output array, which is a modest (10000,20) and takes only 0.1% of the CPU time to update by the threads. There are no explicit locks and the 'tasks' display shows no wait time except for _kmp_launch_monitor. the larger array I hope to share in the future (1000,600,5) is currently brought into each thread by firstprivate, just to keep things as simple as possible at this point. BTW, the only compile conditions where this speedup seems to occur is if I have both the parallelization and OpenMP both checked. Well, here is the command line that seems to make this peculiar situation work sometimes: /nologo /debug:full /MP /O2 /QxHost /Qparallel /Qopenmp /Qopenmp-report1 /Qpar-report2 /Qvec-report1 /warn:none /module:"x64\Debug\\" /object:"x64\Debug\\" /Fd"x64\Debug\vc100.pdb" /traceback /check:bounds /libs:static /threads /dbglibs /Qmkl:parallel /c. Run straight, the timing of the code is consistent to about 2%. What is it that L&W can do that, on some runs, can speed the code up 170%? Whatever that is, it is an important clue to what we have to do to achieve the expected speedup of this code with OpenMP.
0 Kudos
Bruce_Weaver
Beginner
576 Views
I moved to eight threads. With Locks and Waits, I got 8x data in less than 3x the wall time. Just running w/o VTune, I got 8x data in 6x the wall time. Vtune is doing something very helpful. Now I'm only compiling w/ OMP, not parallelization.
0 Kudos
Peter_W_Intel
Employee
576 Views
I'm not familiar with your program. Last time, what I posted are for general considerations to reduce wait time & wait count. Based on info "...The _kmp_launch_monitor is solid wait but I guess that is what it does. There are no transitions other than the ones at startup..." I provide additional info - you might try below, then run program export OMP_WAIT_POLICY=passive ; instead of "active", by default. Reduce "spin" time -OR- export KMP_BLOCKTIME=20 ; default is 200ms
0 Kudos
Bruce_Weaver
Beginner
576 Views
I gather I set these environment commands with a SETENVQQ command. They didn't help. We've been trying to convert this program for a few weeks w/o much success so it's starting to get a bit frustrating. We can do small programs exactly as expected. However, the environment variables may be the issue. Since VTune Locks and Waits gives a two to three times improvement over running the code w/o VTune, I'm guessing now that VTune is setting some environment variables in order to accomplish its job. Someone in your shop must be familiar enough with VTune to shed some light on this. Is there a way to discover what the OMP and KMP environment variables are set to while VTune is running? thanks
0 Kudos
Bruce_Weaver
Beginner
576 Views
nope. tried expanding the OMP_STACKSIZE w/o success. Continues usually to work under locks and waits but never as just executed. Also not in hotspots, it appears. It is hard to debug if it works with the debug tools but not otherwise. Usually when this happens, at least with debuggers, it is often a mismatch in types or such but there is no indication of such.
0 Kudos
Bruce_Weaver
Beginner
576 Views
OK, I upgraded from the initial 2013 release to the second (119). The code now works correctly on a 3rd gen I7 but, when I move the executable (either debug or release) w/o recompiling to an E5 or an older Xeon, it runs even slower than before. Recompiling on the E5 w/ the latest compiler and Vtune gives the same results as before...very poor performance except, sporadically, when run w/ locks and waits. I don't see how we're going to have any confidence in moving this to some MICs until we can straighten this out. What ideas do you have as we're pretty much running out of ideas here?
0 Kudos
Peter_W_Intel
Employee
576 Views
Bruce Weaver wrote:

OK, I upgraded from the initial 2013 release to the second (119). The code now works correctly on a 3rd gen I7 but, when I move the executable (either debug or release) w/o recompiling to an E5 or an older Xeon, it runs even slower than before. Recompiling on the E5 w/ the latest compiler and Vtune gives the same results as before...very poor performance except, sporadically, when run w/ locks and waits. I don't see how we're going to have any confidence in moving this to some MICs until we can straighten this out. What ideas do you have as we're pretty much running out of ideas here?

I think that you are using Intel(R) Paralle Studio XE 2013 SP1, but performance is poor to build for old Xeon. Maybe you missed some advanced compiler's options but I don't know, my opinion is to submit this problem to Intel(R) C++ Compiler forum - http://software.intel.com/en-us/forums/intel-c-compiler for helps.
0 Kudos
Mark_D_Intel
Employee
576 Views
I suspect it's an issue with the memory layout of the arrays. The larger array is 1000*600*5*8 (double precision, I assume?) = 22 M. This will not even fit in L3 cache. And there's a copy in each thread. Any idea of the access pattern of this array - sequential through the elements, random, etc? I suggest running 'Lightweight hotspots' in VTune and look for areas that might have unusually high CPI (cycles per instruction). This might give some clues as to which memory accesses are causing problems. Running under locks and waits may adjust the relative locations of the thread stacks (and other memory locations), and this might affect the caching behavior of the arrays. The '%LOC' intrinsic will print out the address of items - you might see if that changes for some arrays between bare runs and runs under locks and waits. When setting environment variables that affect a program, it's usually best to set them outside the program (in Visual Studio, it's under Configuration Properties->Debugging->Environment), to ensure they are set from the beginning of the program run.
0 Kudos
Reply