I am new to VTune and I am trying to understand its output. I am running a fortran code and get the following output from the HotSpot analysis:
Note that libiomp5.so is taking all of the time. When I run Locks and Waits I get:
If I understand this correctly, the machine is waiting for a barrier in the subrotuine coord_h2b. Is this correct? However, coord_h2b is small and should not be taking any CPU time (at least for this problem). So, I reran that code removing the parallel directives from this subroutine. The CPU time did not change (as I expected), but now the Locks and Waits show this:
The amount of wait time has acutally gone up, but now it is associated with another subroutine, which also should not be important. Can anyone give me insight into what is going on?
I should note that I get an warning when I run VTune that "Symbol file is not found." I compile the code with
ifort -g -openmp -w -recursive -pc 64, but I link with a couple of libraries that are not compiled with the -g option (although these do not take up any CPU time).
First at all, I just want to verify if you are using latest VTune Amplifier XE 2011 Update 2?
I don't know what your example ifort code is. I suspect there is no more work in your parallel code region, so most of CPUstate dropped in "Wait"(IDLE) - see function "OMP Join Barrier..." called by [libiopm5.so]
Please use ifort example code from Composer XE product - /opt/intel/composerxe-2011.0.084/Samples/en_US/Fortran/openmp_samples
[root@NHM02 peter]# ifort -g -fpp -openmp -openmp-report openmp_sample.f90 -o openmp_sample.ifort
openmp_sample.f90(82) (col. 7): remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
openmp_sample.f90(73) (col. 7): remark: OpenMP DEFINED REGION WAS PARALLELIZED.
Using Concurrency Analysis, itseems all OMP Workers display hot function "ompprime" correctly. This case keeps CPU busy and you can see "Wait" timeis short.
Please let me know ifyouhave other questions. Thanks!
Looks from your first Locks&Waits that coord_vb2h was just waiting in the wings to take first chair in the Locks&Waits race. No surprise it should jump to first. I also note that most of the recorded time was marked as idle, and it's in a Join barrier, meaning most of the HW threads are probablysitting idle while a thread finishes some work, suspiciously looking like a load imbalance. Also in play is the critical section in symba5_step_pl, which jumps up a bit when coord_h2b is taken out of the picture. (Maybe the source of the imbalance?) There's also a hint that symba5 is both parallelized through some OMP construct and recursive. That combination could spelldanger.
I think about now the most useful thing will be to reveal a little source code, if you have the freedom to do so. How are these "innocuous" subroutines associated with the symba5 code and its critical section?
Thanks for the reply. It seems to me that it is not a simple load balancing issue because the clock time actually increases as I increase the number of threads. If I isolate the part of the code I am having problems with, I get:
N threads Clock time (s)
Could this be the result of load imbalance?
There is one thing that I forgot to say before. While coord_vb2h and the like have OMP directives in them to parallelize some of the loops, they are not the outermost parallelized loop. It is not my intention to run these in parallel so I set OMP_NESTED=false. Shouldn't this stop the OMP directives in these routine from being called? I can't just remove the directives from these routines because I need them in other parts of the code.
It is true that some of the code is recursive, but the recursive call is well down stream from the outermost parallelized loop. So, each thread should be doing this independently. Is that still a problem.
I am happy to supply the source code, but it is about 20,000 lines. Is there some other way (like constructing a flow chart) to give you want you need?
If nested evaluates to 0, nested parallelism is disabled, which is the default, and nested parallel regions are serialized and executed by the current thread. If nested evaluates to a nonzero value, nested parallelism is enabled, and parallel regions that are nested. This call has precedence over the OMP_NESTED environment variable.
I'm not the expert on this, you may raise this problem on Intel Fortran Compiler forum, also I suggest to read http://software.intel.com/en-us/forums/showthread.php?t=70018
I am usuing the subrotuine calls. In particular, at the very beginning of the code, I have:
c... OMP stuff
!$ write(*,'(a)') ' OpenMP parameters:'
!$ write(*,'(a)') ' ------------------'
!$ write(*,*) ' Dynamic thread allocation = ',OMP_get_dynamic()
!$ call OMP_set_nested(.false.)
!$ write(*,*) ' Nested parallel loops = ',OMP_get_nested()
!$ nthreads = OMP_get_max_threads() ! In the *parallel* case
!$ write(*,'(a,i3,/)') ' Number of threads = ', nthreads
The output looks like:
Dynamic thread allocation = F
Nested parallel loops = F
Number of threads = 8
Thanks for any insight that you can give.
So, what is the configuration of the machine you ran these numbers upon? 2-socket/6-core? 2-socket/3-core with Hyper-Threading technology? Yes, setting OMP_NESTED false should prevent overcommitment of the thread teams, but I'm still concerned about that critical section that is making noise in symba5_step_pl. You say the recursion is lower down and restricted to individual threads calling themselves (purely, or under the aegis of OMP_NESTED==false?)--is there any chance that the critical section is entered within the range of that recursion? That might lock up a thread.
The numbers you cite above seem consistent with a number of typical bottleneck scenarios, usually moderated by memory access (performance improves until limited by contention between the HW threads, at which point more threads add more contention and more overhead). I'm in high speculation mode right now, but if one of the threads in the team got hung up in some recursive delays due to resource conflicts, possibly a critical section, it could cause symptoms similar to this as other members of the thread team spin at one of the join points waiting for the prodigal thread. Like I said, just a guess but it seems mostly consistent with the facts you've shared. Even if it is a bad guess, it might provide some insights into the problem you do face.
If the OMP_NESTED stuff is working correctly then I am sure that the recursion is well downstream from the place in the code where the parallelized do loop is. The symba5_step_pl routine is not in the part of the code that is given me problems. That part actually behaves quite nicely.
BTW: I have taken your advice and started a thread on the fortran forum (http://software.intel.com/en-us/forums/showthread.php?t=81765&p=1#145830).
Thanks for you insight so far and would be interedted in any other comments you may have.
The curve your timing data describe is in the classic shape of a parallel resource contention. Most of the big times in the VTune Amplifier screen shots you've shared are spin-waits in OMP join code--these are the rendezvous points at the end of parallel do-loops where the workers in the thread team wait for their peers to finish the work of the do-loop. By the looks of the graphs you've shown, most of the wait time is occupied in spinning in these joins.
Here's a wild idea that I've never tried before. Why don't you add a nowait directive to all your parallel do-loops as a diagnostic test? It may cause your program to crash. It will most likely cause your program to compute bad results. But it may also cause those spin-waits to go away by letting team threads proceed as soon as they are done. If that all works and doesn't crash, you might be able to find out which of your loops are the big waiters by selectively applying nowait and see what effect it has on locks&waits analysis.
So, for each subrotuine in my code that contains parallel code I had to do the following:
[bash] subroutine symba5_kick(nbod,mass,irec,iecnt,ielev,
c... Inputs Only:
c... Inputs and Outputs:
!$ logical OMP_in_parallel
c... Executable code
!$ if (omp_in_parallel()) then
!$ call symba5_kick_P(nbod,mass,irec,iecnt,ielev,
end ! symba5_kick.f
where symba5_kick_S and symba5_kick_P are serial and parallel versions of the code, respectively. It is a pain, but it appears to solve my speed problem.
Thanks for all your help!
Also, is it the case that sometimes you want functions like symba5_kick to run in parallel and other times when you want them to run as a single-threaded support subroutine for some parallel caller? I know now that you sorted out the bottleneck in your code, you're probably hot to continue developing that, but I'm not sure yet that I understand the particulars of your code. Yes, from what I understand of the problem I would expect that a call from an OMP parallel loop that fell into another OMP parallel loop with OMP_NESTED=false asserted should not spin at the inner loop join, but I'm not yet sure I could recreate the conditions that you encountered in your code (I'm still in the dark on the fine structure of its design). If you could take the time to assemble some means for me to reproduce the problem (anything from a simple description of the code that shows the function hierarchy, OMP Parallel placements and recursive components, all the way up to actual code that demonstratesthe same problem), that would be greatly appreciated.
I will try to put something together. The simple test that I tried to put together behaved correctly, and did not do what I though it would. In particular:
[bash] real a(4) c... OMP stuff !$ logical OMP_get_dynamic,OMP_get_nested !$ integer nthreads,OMP_get_max_threads c... OMP stuff !$ write(*,'(a)') ' OpenMP parameters:' !$ write(*,'(a)') ' ------------------' !$ write(*,*) ' Dynamic thread allocation = ',OMP_get_dynamic() !$ call OMP_set_nested(.false.) !$ write(*,*) ' Nested parallel loops = ',OMP_get_nested() !$ nthreads = OMP_get_max_threads() ! In the *parallel* case !$ write(*,'(a,i3,/)') ' Number of threads = ', nthreads !$omp parallel do default(none) shared(a) do i=1,4 call sub(i,a) enddo !$omp end parallel do stop end c--------------------- subroutine sub(i,a) real a(4) integer omp_get_thread_num write(*,*) 'start ', i !$omp parallel do shared(a) private(j) do j=1,4 a(j) = 1.0 if( (i.eq.1).and.(j.eq.1) ) then do while(.true.) a(j) = 1.0 enddo endif end do !$omp end parallel do write(*,*) 'mid ', i !$omp parallel do shared(a) private(j) do j=1,4 a(j) = 1.0 end do !$omp end parallel do write(*,*) 'end ', i return end [/bash]
The output was:
Dynamic thread allocation = F
Nested parallel loops = F
Number of threads = 4
If my hypothesis had been correct, none of the threads would have gotten to 'end', but three of them did. Let me play around a bit more.
Did you have any time to waste on my nowait idea? Bug or design-blockage, it might produce some interesting new symptoms, which might provide a clue. Fun stuff. Maybe not for you, but I'm having fun trying weave through the maze.