libiomp5.so taking all the time - VTune Amp, XE 2011 and ifort

hallevison · ‎03-21-2011

Hi:

I am new to VTune and I am trying to understand its output. I am running a fortran code and get the following output from the HotSpot analysis:

Note that libiomp5.so is taking all of the time. When I run Locks and Waits I get:

If I understand this correctly, the machine is waiting for a barrier in the subrotuine coord_h2b. Is this correct? However, coord_h2b is small and should not be taking any CPU time (at least for this problem). So, I reran that code removing the parallel directives from this subroutine. The CPU time did not change (as I expected), but now the Locks and Waits show this:

The amount of wait time has acutally gone up, but now it is associated with another subroutine, which also should not be important. Can anyone give me insight into what is going on?

I should note that I get an warning when I run VTune that "Symbol file is not found." I compile the code with
ifort -g -openmp -w -recursive -pc 64, but I link with a couple of libraries that are not compiled with the -g option (although these do not take up any CPU time).

Thanks.

- Hal

Peter_W_Intel · ‎03-22-2011

Hi Hal,

First at all, I just want to verify if you are using latest VTune Amplifier XE 2011 Update 2?

I don't know what your example ifort code is. I suspect there is no more work in your parallel code region, so most of CPUstate dropped in "Wait"(IDLE) - see function "OMP Join Barrier..." called by [libiopm5.so]

Please use ifort example code from Composer XE product - /opt/intel/composerxe-2011.0.084/Samples/en_US/Fortran/openmp_samples
[root@NHM02 peter]# ifort -g -fpp -openmp -openmp-report openmp_sample.f90 -o openmp_sample.ifort
openmp_sample.f90(82) (col. 7): remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
openmp_sample.f90(73) (col. 7): remark: OpenMP DEFINED REGION WAS PARALLELIZED.

Using Concurrency Analysis, itseems all OMP Workers display hot function "ompprime" correctly. This case keeps CPU busy and you can see "Wait" timeis short.

Please let me know ifyouhave other questions. Thanks!

Regards, Peter

robert-reed · ‎03-27-2011

Looks from your first Locks&Waits that coord_vb2h was just waiting in the wings to take first chair in the Locks&Waits race. No surprise it should jump to first. I also note that most of the recorded time was marked as idle, and it's in a Join barrier, meaning most of the HW threads are probablysitting idle while a thread finishes some work, suspiciously looking like a load imbalance. Also in play is the critical section in symba5_step_pl, which jumps up a bit when coord_h2b is taken out of the picture. (Maybe the source of the imbalance?) There's also a hint that symba5 is both parallelized through some OMP construct and recursive. That combination could spelldanger.

I think about now the most useful thing will be to reveal a little source code, if you have the freedom to do so. How are these "innocuous" subroutines associated with the symba5 code and its critical section?

hallevison · ‎03-28-2011

Hi:

Thanks for the reply. It seems to me that it is not a simple load balancing issue because the clock time actually increases as I increase the number of threads. If I isolate the part of the code I am having problems with, I get:

N threads Clock time (s)
1 136.
2 90.
4 65.
6 113.
8 165.
10 202.
12 237.

Could this be the result of load imbalance?
There is one thing that I forgot to say before. While coord_vb2h and the like have OMP directives in them to parallelize some of the loops, they are not the outermost parallelized loop. It is not my intention to run these in parallel so I set OMP_NESTED=false. Shouldn't this stop the OMP directives in these routine from being called? I can't just remove the directives from these routines because I need them in other parts of the code.

It is true that some of the code is recursive, but the recursive call is well down stream from the outermost parallelized loop. So, each thread should be doing this independently. Is that still a problem.

I am happy to supply the source code, but it is about 20,000 lines. Is there some other way (like constructing a flow chart) to give you want you need?

Thanks again

-Hal

Peter_W_Intel · ‎03-29-2011

Is it possible to use omp_set_nested(int nested) instead of environment variable?

If nested evaluates to 0, nested parallelism is disabled, which is the default, and nested parallel regions are serialized and executed by the current thread. If nested evaluates to a nonzero value, nested parallelism is enabled, and parallel regions that are nested. This call has precedence over the OMP_NESTED environment variable.

I'm not the expert on this, you may raise this problem on Intel Fortran Compiler forum, also I suggest to read http://software.intel.com/en-us/forums/showthread.php?t=70018

Regards, Peter

hallevison · ‎03-29-2011

Hi:

I am usuing the subrotuine calls. In particular, at the very beginning of the code, I have:

c... OMP stuff
!$ write(*,'(a)') ' OpenMP parameters:'
!$ write(*,'(a)') ' ------------------'
!$ write(*,*) ' Dynamic thread allocation = ',OMP_get_dynamic()
!$ call OMP_set_nested(.false.)
!$ write(*,*) ' Nested parallel loops = ',OMP_get_nested()
!$ nthreads = OMP_get_max_threads() ! In the *parallel* case
!$ write(*,'(a,i3,/)') ' Number of threads = ', nthreads

The output looks like:

OpenMP parameters:
------------------
Dynamic thread allocation = F
Nested parallel loops = F
Number of threads = 8

Thanks for any insight that you can give.

-Hal

robert-reed · ‎03-29-2011

So, what is the configuration of the machine you ran these numbers upon? 2-socket/6-core? 2-socket/3-core with Hyper-Threading technology? Yes, setting OMP_NESTED false should prevent overcommitment of the thread teams, but I'm still concerned about that critical section that is making noise in symba5_step_pl. You say the recursion is lower down and restricted to individual threads calling themselves (purely, or under the aegis of OMP_NESTED==false?)--is there any chance that the critical section is entered within the range of that recursion? That might lock up a thread.

The numbers you cite above seem consistent with a number of typical bottleneck scenarios, usually moderated by memory access (performance improves until limited by contention between the HW threads, at which point more threads add more contention and more overhead). I'm in high speculation mode right now, but if one of the threads in the team got hung up in some recursive delays due to resource conflicts, possibly a critical section, it could cause symptoms similar to this as other members of the thread team spin at one of the join points waiting for the prodigal thread. Like I said, just a guess but it seems mostly consistent with the facts you've shared. Even if it is a bad guess, it might provide some insights into the problem you do face.

hallevison · ‎03-29-2011

I am using a machine with 2 Six-Core AMD Opteron 2439 SE Processosrs, which do not have Hyper-Threading technology.

If the OMP_NESTED stuff is working correctly then I am sure that the recursion is well downstream from the place in the code where the parallelized do loop is. The symba5_step_pl routine is not in the part of the code that is given me problems. That part actually behaves quite nicely.

BTW: I have taken your advice and started a thread on the fortran forum (http://software.intel.com/en-us/forums/showthread.php?t=81765&p=1#145830).

Thanks for you insight so far and would be interedted in any other comments you may have.

-Hal.

robert-reed · ‎03-29-2011

When you say "downstream" from the parallelized do-loop, do you mean buried way down in the call hierarchy within the scope of the parallel loop, or do you mean that the recursive code is outside the scope of the parallel do-loop? If you mean after (i.e., outside of the scope of) the parallel do-loop, then I would agree with you. If, however, the recursion is inside the code using critical sections, you might have a problem there, and to help I will have to have a better understanding of those interactions. I might not need your 20,000 line source but I've taken this about as far as I can without more details: maybe a block diagram showing the arrangment of the parallel loops, critical sections and recursive code, with relevant code snippets.

The curve your timing data describe is in the classic shape of a parallel resource contention. Most of the big times in the VTune Amplifier screen shots you've shared are spin-waits in OMP join code--these are the rendezvous points at the end of parallel do-loops where the workers in the thread team wait for their peers to finish the work of the do-loop. By the looks of the graphs you've shown, most of the wait time is occupied in spinning in these joins.

Here's a wild idea that I've never tried before. Why don't you add a nowait directive to all your parallel do-loops as a diagnostic test? It may cause your program to crash. It will most likely cause your program to compute bad results. But it may also cause those spin-waits to go away by letting team threads proceed as soon as they are done. If that all works and doesn't crash, you might be able to find out which of your loops are the big waiters by selectively applying nowait and see what effect it has on locks&waits analysis.

hallevison · ‎03-30-2011

Well, the solution has come from the other thread I referenced above (http://software.intel.com/en-us/forums/showthread.php?t=81765&p=1#145921). It looks to me that even when OMP_NESTED=false, the the compiler was putting in barriers in the nested loops when it is not needed. This seems like a bug to me.
So, for each subrotuine in my code that contains parallel code I had to do the following:

[bash]      subroutine symba5_kick(nbod,mass,irec,iecnt,ielev,
     &        rhill,xh,yh,zh,vxb,vyb,vzb,dt,sgn,ielc,ielst)

      include '../swift.inc'
      include 'symba5.inc'

c...  Inputs Only: 
      integer nbod,irec
      real*8 mass(nbod),dt,rhill(nbod),sgn
      integer*2 iecnt(NTPMAX),ielev(nbod)
      real*8 xh(nbod),yh(nbod),zh(nbod)
      integer*2 ielst(2,NENMAX),ielc

c...  Inputs and Outputs:
      real*8 vxb(nbod),vyb(nbod),vzb(nbod)

c...  Internals: 
!$    logical OMP_in_parallel

c-----
c...  Executable code 

!$    if (omp_in_parallel()) then
         call symba5_kick_S(nbod,mass,irec,iecnt,ielev,
     &        rhill,xh,yh,zh,vxb,vyb,vzb,dt,sgn,ielc,ielst)
!$    else
!$       call symba5_kick_P(nbod,mass,irec,iecnt,ielev,
     &        rhill,xh,yh,zh,vxb,vyb,vzb,dt,sgn,ielc,ielst)
!$    endif

      return
      end      ! symba5_kick.f
c--------------------------------------------------------------
[/bash]

where symba5_kick_S and symba5_kick_P are serial and parallel versions of the code, respectively. It is a pain, but it appears to solve my speed problem.

Thanks for all your help!

robert-reed · ‎03-30-2011

Great! I'm happy to hear that you found a way (however gross;-) to bypass the spin waits that were idling your threads, but in pursuit of the bug you suggest might exist, I'd like to pursue this a little further. Did you perchance try the nowait hack I suggested/did that have any effect?

Also, is it the case that sometimes you want functions like symba5_kick to run in parallel and other times when you want them to run as a single-threaded support subroutine for some parallel caller? I know now that you sorted out the bottleneck in your code, you're probably hot to continue developing that, but I'm not sure yet that I understand the particulars of your code. Yes, from what I understand of the problem I would expect that a call from an OMP parallel loop that fell into another OMP parallel loop with OMP_NESTED=false asserted should not spin at the inner loop join, but I'm not yet sure I could recreate the conditions that you encountered in your code (I'm still in the dark on the fine structure of its design). If you could take the time to assemble some means for me to reproduce the problem (anything from a simple description of the code that shows the function hierarchy, OMP Parallel placements and recursive components, all the way up to actual code that demonstratesthe same problem), that would be greatly appreciated.

hallevison · ‎03-30-2011

Hi Robert:

I will try to put something together. The simple test that I tried to put together behaved correctly, and did not do what I though it would. In particular:

[bash]      real a(4)

c...  OMP stuff
!$	logical OMP_get_dynamic,OMP_get_nested
!$	integer nthreads,OMP_get_max_threads


c...  OMP stuff
!$    write(*,'(a)')      ' OpenMP parameters:'
!$    write(*,'(a)')      ' ------------------'
!$    write(*,*) '   Dynamic thread allocation = ',OMP_get_dynamic()
!$    call OMP_set_nested(.false.)
!$    write(*,*) '   Nested parallel loops = ',OMP_get_nested()
!$    nthreads = OMP_get_max_threads() ! In the *parallel* case
!$    write(*,'(a,i3,/)') '   Number of threads  = ', nthreads 


!$omp parallel do default(none) shared(a)
      do i=1,4
         call sub(i,a)
      enddo
!$omp end parallel do  

      stop
      end

c---------------------
      subroutine sub(i,a)  
      real a(4)
      integer omp_get_thread_num

      write(*,*) 'start ', i

!$omp parallel do shared(a) private(j)  
      do j=1,4  
         a(j) = 1.0  
         if( (i.eq.1).and.(j.eq.1) ) then
            do while(.true.)
               a(j) = 1.0  
            enddo
         endif
      end do  
!$omp end parallel do  

      write(*,*) 'mid  ', i

!$omp parallel do shared(a) private(j)  
      do j=1,4  
         a(j) = 1.0  
      end do  
!$omp end parallel do  

      write(*,*) 'end  ', i

      return
      end
[/bash]

The output was:
OpenMP parameters:
------------------
Dynamic thread allocation = F
Nested parallel loops = F
Number of threads = 4

start 1
start 2
mid 2
end 2
start 4
mid 4
end 4
start 3
mid 3
end 3

If my hypothesis had been correct, none of the threads would have gotten to 'end', but three of them did. Let me play around a bit more.

robert-reed · ‎03-30-2011

This is actually good news. I would expect any simple replication of what you think the problem is would be a test case that we would have already caught in our prerelease testing. Therefore, I think you may have captured either some particular bug or livelock in your own code, or some subtle RTL bug that because of its complex nature illudes our internal testing. I am particularly interested in that latter case. If there is some condition where OMP_NESTED=false but inner parallel DO-loops aren't automatically doing a nowait--there should be nothing to wait for--then there is a bug.

Did you have any time to waste on my nowait idea? Bug or design-blockage, it might produce some interesting new symptoms, which might provide a clue. Fun stuff. Maybe not for you, but I'm having fun trying weave through the maze.