First at all, I just want to verify if you are using latest VTune Amplifier XE 2011 Update 2?
I don't know what your example ifort code is. I suspect there is no more work in your parallel code region, so most of CPUstate dropped in "Wait"(IDLE) - see function "OMP Join Barrier..." called by [libiopm5.so]
Please use ifort example code from Composer XE product - /opt/intel/composerxe-2011.0.084/Samples/en_US/Fortran/openmp_samples
[root@NHM02 peter]# ifort -g -fpp -openmp -openmp-report openmp_sample.f90 -o openmp_sample.ifort
openmp_sample.f90(82) (col. 7): remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
openmp_sample.f90(73) (col. 7): remark: OpenMP DEFINED REGION WAS PARALLELIZED.
Using Concurrency Analysis, itseems all OMP Workers display hot function "ompprime" correctly. This case keeps CPU busy and you can see "Wait" timeis short.
Please let me know ifyouhave other questions. Thanks!
Looks from your first Locks&Waits that coord_vb2h was just waiting in the wings to take first chair in the Locks&Waits race. No surprise it should jump to first. I also note that most of the recorded time was marked as idle, and it's in a Join barrier, meaning most of the HW threads are probablysitting idle while a thread finishes some work, suspiciously looking like a load imbalance. Also in play is the critical section in symba5_step_pl, which jumps up a bit when coord_h2b is taken out of the picture. (Maybe the source of the imbalance?) There's also a hint that symba5 is both parallelized through some OMP construct and recursive. That combination could spelldanger.
I think about now the most useful thing will be to reveal a little source code, if you have the freedom to do so. How are these "innocuous" subroutines associated with the symba5 code and its critical section?
If nested evaluates to 0, nested parallelism is disabled, which is the default, and nested parallel regions are serialized and executed by the current thread. If nested evaluates to a nonzero value, nested parallelism is enabled, and parallel regions that are nested. This call has precedence over the OMP_NESTED environment variable.
I'm not the expert on this, you may raise this problem on Intel Fortran Compiler forum, also I suggest to read http://software.intel.com/en-us/forums/showthread.php?t=70018
So, what is the configuration of the machine you ran these numbers upon? 2-socket/6-core? 2-socket/3-core with Hyper-Threading technology? Yes, setting OMP_NESTED false should prevent overcommitment of the thread teams, but I'm still concerned about that critical section that is making noise in symba5_step_pl. You say the recursion is lower down and restricted to individual threads calling themselves (purely, or under the aegis of OMP_NESTED==false?)--is there any chance that the critical section is entered within the range of that recursion? That might lock up a thread.
The numbers you cite above seem consistent with a number of typical bottleneck scenarios, usually moderated by memory access (performance improves until limited by contention between the HW threads, at which point more threads add more contention and more overhead). I'm in high speculation mode right now, but if one of the threads in the team got hung up in some recursive delays due to resource conflicts, possibly a critical section, it could cause symptoms similar to this as other members of the thread team spin at one of the join points waiting for the prodigal thread. Like I said, just a guess but it seems mostly consistent with the facts you've shared. Even if it is a bad guess, it might provide some insights into the problem you do face.
[bash] subroutine symba5_kick(nbod,mass,irec,iecnt,ielev,
c... Inputs Only:
c... Inputs and Outputs:
!$ logical OMP_in_parallel
c... Executable code
!$ if (omp_in_parallel()) then
!$ call symba5_kick_P(nbod,mass,irec,iecnt,ielev,
end ! symba5_kick.f
[bash] real a(4) c... OMP stuff !$ logical OMP_get_dynamic,OMP_get_nested !$ integer nthreads,OMP_get_max_threads c... OMP stuff !$ write(*,'(a)') ' OpenMP parameters:' !$ write(*,'(a)') ' ------------------' !$ write(*,*) ' Dynamic thread allocation = ',OMP_get_dynamic() !$ call OMP_set_nested(.false.) !$ write(*,*) ' Nested parallel loops = ',OMP_get_nested() !$ nthreads = OMP_get_max_threads() ! In the *parallel* case !$ write(*,'(a,i3,/)') ' Number of threads = ', nthreads !$omp parallel do default(none) shared(a) do i=1,4 call sub(i,a) enddo !$omp end parallel do stop end c--------------------- subroutine sub(i,a) real a(4) integer omp_get_thread_num write(*,*) 'start ', i !$omp parallel do shared(a) private(j) do j=1,4 a(j) = 1.0 if( (i.eq.1).and.(j.eq.1) ) then do while(.true.) a(j) = 1.0 enddo endif end do !$omp end parallel do write(*,*) 'mid ', i !$omp parallel do shared(a) private(j) do j=1,4 a(j) = 1.0 end do !$omp end parallel do write(*,*) 'end ', i return end [/bash]