Profiling OpenMP on Mac

moortgatgmail_com · ‎04-09-2012

Hi,

Can anyone give me advice on how to profile OpenMP performance on a Mac?

I've searched all the forums, but the only suggestion I can find is to use -openmp-profile. However, when I try that, the option is indicated as deprecated and overriden by -openmp. I cannot even obtain the `guide' textfile to guide me.

The issue at hand is the parallel execution of a do loop that takes accounts for up to 80-90% CPU time in a fairly large finite element code. The loop iterations are independent. When I carry out a simulation on 4 cores, the speed-up is only about a factor 2 w.r.t. serial execution. When I look at the performance with Shark, I can see that a very large percentage of CPU time is spent on KMP tasks. For example, 67.9% on KMP_INVOKE_MICROTASK, about 8% on KMP_HYPER_BARRIER_RELEASE/KMP_LAUNCH_THREAD and another 5% on KMP_X86_PAUSE and KMP_EXECUTE_TASKS

Right now, I cannot get any information on which calls in my code are causing this kind of openmp overhead.

Any suggestions would be appreciated.

--Joachim

Below are the top results from Shark:

67.9% 67.9% libiomp5.dylib __kmp_invoke_microtask

12.6% 12.6% libSystem.B.dylib log$fenv_access_off

7.7% 7.7% libiomp5.dylib __kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*)

3.0% 3.0% libiomp5.dylib __kmp_x86_pause

2.0% 2.0% libiomp5.dylib __kmp_execute_tasks

TimP · ‎04-09-2012

openmp-profile can be extremely useful, so I'm disappointed if the threats to eliminate it are being carried out with no clear replacement. If it's still there, don't let "deprecated" discourage you. The file it creates (guide.gvs) would give you more information on those barrier calls, e.g. how much of the time is associated with work imbalance among the threads. In order to get meaningful results, of course, you must set KMP_AFFINITY and OMP_NUM_THREADS appropriately, according to whether you have HyperThreading enabled and how many cores/threads are best for your task.
Could it be that you don't have enough work per thread task (long enough loops)?

moortgatgmail_com · ‎04-09-2012

Thanks for your comments.

It seems like openmp-profile is already eliminated. When I compile and link with the flag -openmp-profile (through makefile), and run the executable, no guide.gvs file is created. During compilation, I get a warning that openmp is used to override openmp-profile. (I'm using the latestcomposer_xe_2011_sp1.9.289)

The loops that I'm considering for parallelization carry out the bulk of the CPU time for the program. In my current problem, there are about 20,000 iterations in this loop and each iteration calls a subroutine that requires a fraction of a second. However, individual iterations may require quite different CPU times. This is why I want to get better load balancing diagnostics. Right now, I'm doing more of a trial-and-error approach with the different scheduling options (static, dynamic or guided). Ideally, one thread should carry out multiple fast iterations if another thread needs more time for a slower iteration, rather than having all threads waiting for 1 chunk to finish.

I'm overridingOMP_NUM_THREADS from within the OMP pragma with

!$OMP PARALLEL NUM_THREADS(NRTHREADS) andNRTHREADS given as input for the program. I'm not familiar wih theKMP_AFFINITY variable, so I should look into that.

Any other suggestions?

TimP · ‎04-09-2012

On Windows and linux, openmp-profile is supported by running with the libiompprof5 dynamic library on path ahead of libiomp5. If there is no libiompprof5, then it appears that this option is removed from the installation. It doesn't necessarily make a difference whether you link with openmp-profile, or (on linux) set LD_PRELOAD=.
Without KMP_AFFINITY setting, Intel Openmp doesn't give you a consistent allocation of threads to logical processors. Assuming that no other task is running, KMP_AFFINITY allows you to optimize the assignment of threads to logical processors. If you have HT enabled, and wish to spread your OpenMP threads 1 per core, you might set e.g. KMP_AFFINITY=compact,1 (use every other logical processor, starting at 0).

moortgatgmail_com · ‎04-10-2012

Very useful comments. I can now get a guide.gvs file, although I may have misunderstood your linking comment. Should I link both libiompprof5 and libiomp5? Right now I'm getting an error that the libiompprof5 has already been initialized (which I temporarily resolved with KMP_DUPLICATE_LIB_OK=TRUE).

The results in the guide.gvs file are puzzling though: it seems that the barriers and load balancing are quite good with the max imbalance and barriers less than one percent of CPU time. However, when I profile the running executable with Shark, I still get 20-25% of total CPU time used by KMP_HYPER_BARRIER_RELEASE (specifically, KMP_LAUNCH_THREAD), and another 8.4% by KMP_X86_PAUSE (also from KMP_LAUNCH_THREAD).

I also used you suggestion KMP_AFFINITY=compact,1 , but I'm surprised to see that this results in 9 threads being created, when I specify NUM_THREADS(4) in each OMP PARALLEL pragma.

More generally, could you point me to a clear resource on how to best take advantage of these versatile quad-core i7 intel processors? Specifically: when I run in serial mode on 1 core, I can get an advantage from Turbo-boost/`overclocking'. When I run in parallel, can I get a similar scaling to the 8 virtual cores, or only up to the 4 real cores? And is there an advantage to having more threads than cores? From what little I could find on this site, it appears that 4 threads for the 4 physical cores may be the optimum. I was expecting that your suggestion ofKMP_AFFINITY=compact,1 would take care of this, but at 9 threads it may not.

The background is that this is a large subsurface flow simulator with runs taking several hours. Up to 90% of CPU time is spent on the loops that I parallelized, so if the load balancing etc is decent, I hope to achieve a speed up of close to 4 on 4 cores (or 8 if the 8 virtual cores are equivalent to the physical cores). Right now, I'm only getting a speed-up by a factor of about 2-2.5, so it seems that the openmp/threading overhead is significant.