executable hanging in OpenMP region

Michael_Barkhudarov · ‎08-25-2015

We updated from version 13.1 of the compiler to 15.0.2.179 for a large CFD FORTRAN code and started having cases of the solver hanging: sitting in memory with 0% CPU used. Here are the observations so far:

1. happens only on Windows.

2. running on a single thread works fine.

3. the executable runs for a while before it hangs. restarting the solver from an intermediate time runs fine.

4. for the cases we have so far, it hangs in different parts of the code, but for a given case it always hangs at the same location.

Here is a code snippet where it happens in one of the cases. This is a sub-section of a large subroutine. This code is invoked several million times before the solver hangs:

c Here we are adding a column to the Hessenburg matrix (hmatrix)
        do kk=1,igfy
          hmatrix(kk,igfy)=zero
          do n=1,numthrds
            flp_lcl(1,n)=hmatrix(kk,igfy)
          enddo
c
          do nbl=1,nblcks
! divide the loop between threads
            call load_bal(0,ijklim(nbl,1),nijkpr(nbl)) ! NBL is a global variable, comes from a module
!$omp parallel do schedule(static,1)
!$omp& private(n,nn,ijk)
            do n=1,numthrds ! loop over threads
              do nn=klo(n),khi(n)
                ijk=ijkpr(nn)
                flp_lcl(1,n)=flp_lcl(1,n)+vvect(ijk,igfyp1)*vvect(ijk,kk)
              enddo

! PUTTING A PRINT STATEMENT HERE PRINTS ALL THREADS.
enddo ! THIS IS THE LINE WHERE IT HANGS.
enddo

Could this have been resolved in version 15.4 or 16.0 of the compiler?

Thank you for any help you can provide.

Michael

John_Campbell · ‎08-25-2015

Would it help to declare flp_lcl as shared ?

I would use:

!$omp parallel do                                                       &
!$omp& schedule (static,1)                                              &
!$omp& shared   (kk, numthrds, klo, khi, ijkpr, flp_lcl, vvect, igfyp1) &
!$omp& private  (n, nn, ijk)
             do n = 1,numthrds ! loop over threads
               do nn = klo(n),khi(n)
                 ijk = ijkpr(nn)
                 flp_lcl(1,n) = flp_lcl(1,n) + vvect(ijk,igfyp1) * vvect(ijk,kk)
               end do
! PUTTING A PRINT STATEMENT HERE PRINTS ALL THREADS.
             end do ! THIS IS THE LINE WHERE IT HANGS.  
!$omp end parallel do

John

( your example mixes fixed and free format ?)

Michael_Barkhudarov · ‎08-25-2015

Hi John,

Thanks for the quick response.

Isn't FLP_LCL shared by default? I failed to mention that we use the approach of arrays of size (1:numthrds) to do the reductions, instead of using OpenMP contructs for that. We've been doing this for years, so i have faith in it :-). The first index is to space the 'n' access by at least a cache line to reduce unnecessary cache updates.

I am trying now explicit 'shared' pragma placements. Also, i am replacing KK in that loop, which is an outside DO loop iteration that includes the OpenMP loop, with a local variable just for the OpenMP loop. Maybe it will make it easier for the compiler to process it.

You are right about the mix of fixed and free format. The original code is actually using fixed format, i just modified it here to be more compact. Sorry about it.

Michael

Michael_Barkhudarov · ‎08-26-2015

Hi again, John.

I tried your suggestion - same result. Hangs at the same location. Am going to version 15.0.4 of the compiler now.

Michael

jimdempseyatthecove · ‎08-26-2015

Try a variation to see what is happening:

!$omp parallel                                                        &
!$omp& shared   (kk, numthrds, klo, khi, ijkpr, flp_lcl, vvect, igfyp1) &
!$omp& private  (n, nn, ijk)
print *,omp_get_thread_num()
!$omp do                                                       &
!$omp& schedule (static,1)                                              &
             do n = 1,numthrds ! loop over threads
               do nn = klo(n),khi(n)
                 ijk = ijkpr(nn)
                 flp_lcl(1,n) = flp_lcl(1,n) + vvect(ijk,igfyp1) * vvect(ijk,kk)
               end do
             end do ! THIS IS THE LINE WHERE IT HANGS.  
!$omp end do
print *,omp_get_thread_num()
!$omp end parallel

Jim Dempsey

jimdempseyatthecove · ‎08-26-2015

You might also want to verify numthrds is what you expect.

You should see each thread number twice. If one of them does not print out twice, verify that do nn= has valid ranges. A very large index range without subscript checking could be walking over non-mapped virtual memory (may take a long time to error out).

Jim Dempsey

Michael_Barkhudarov · ‎08-26-2015

Will give it a shot, Jim, thank you.

Michael_Barkhudarov · ‎08-31-2015

Per Jim's suggestion, i added the print statements to the DO loop (the variable cycle is used to reduce the amount of output). It print the thread id, then the loop iteration, than a tag: 'a' at the beginning of the loop, 'b' - at the end.

!$omp parallel do schedule(static,1)
!$omp& private(n,nn,ijk)
!$omp& shared(kk_loc,igfyp1,klo,khi,ijkpr,dmx_lcl,vvect,numthrds)
            do n=1,numthrds   ! MRB
    if(cycle.gt.370000) print *,omp_get_thread_num(),n,'a'
              do nn=klo(n),khi(n)
                ijk=ijkpr(nn)
                dmx_lcl(1,n)=dmx_lcl(1,n)
     &                    +vvect(ijk,igfyp1)*vvect(ijk,kk_loc)
              enddo
         if(cycle.gt.370000) print *,omp_get_thread_num(),n,'b'
            enddo
          enddo

the result for the last three time the loop is executed is:

           0           1 a
           1           2 a
           2           3 a
           3           4 a
           0           1 b
           1           2 b
           2           3 b
           3           4 b

           0           1 a
           3           4 a
           2           3 a
           1           2 a
           0           1 b
           3           4 b
           2           3 b
           1           2 b

           3           4 a
           2           3 a
           0           1 a
           1           2 a
           3           4 b
           2           3 b
           0           1 b
           1           2 b

After the last line is hangs. I also tried 15.0.4 and 13.1 compilers to the same result. There's gotta be something wrong with this code, but i can't figure it out!

jimdempseyatthecove · ‎08-31-2015

While you still have some hair left, try inserting some asserts:

if(n > ubound(klo)) print "(n > ubound(klo))"
if(n > ubound(khi)) print "(n > ubound(khi))"
do nn=klo(n),khi(n)
  if(nn < lbound(ijkpr)) print "(nn < lbound(ijkpr))"
  if(nn > ubound(ijkpr)) print "(nn > ubound(ijkpr))"
  ijk=ijkpr(nn)
  if(n > ubound(dmx_lcl,dim=2)) print "(n > ubound(dmx_lcl,dim=2))"
  if(ijk < lbound(vvect, dim=1)) print "(ijk < lbound(vvect, dim=1))"
  if(ijk > ubound(vvect, dim=1)) print "(ijk > ubound(vvect, dim=1))"
  if(igfyp1 < lbound(vvect, dim=2)) print "(igfyp1 < lbound(vvect, dim=2))"
  if(igfyp1 > ubound(vvect, dim=2)) print "(igfyp1 > ubound(vvect, dim=2))"
  if(kk_loc < lbound(vvect, dim=2)) print "(kk_loc < lbound(vvect, dim=2))"
  if(kk_loc > ubound(vvect, dim=2)) print "(kk_loc > ubound(vvect, dim=2))"
  dmx_lcl(1,n)=dmx_lcl(1,n)+vvect(ijk,igfyp1)*vvect(ijk,kk_loc)
enddo

Bounds checking should have caught this.

Note, if this is a Windows app (IOW not a console app), then the console error dump window may appear underneath the Window of the application *** or be a NULL bit bucket ***.

Jim Dempsey

jimdempseyatthecove · ‎08-31-2015

By the way, I usually have a subroutine named DOSTOP that I call with the error message. I can then place a single break point in the DOSTOP routine to catch any call, then step out to investigate the cause.

Jim Dempsey

Michael_Barkhudarov · ‎09-02-2015

on it!

Michael_Barkhudarov · ‎09-15-2015

Update so far.

Ran the solver on Linux through Allinea's DDT debugger. The traceback information at the point of hanging is in the image below.

Updated to the 16.0 version of Intel's compiler. The problem started working on Linux, but still hangs on Windows. Looks like the problem may be in the dynamic libiomp5 library. I can't believe no one else is having such problems. Could some please help with any pointers?

jimdempseyatthecove · ‎09-15-2015

There is a separate thread on the forum that is exhibiting a hang situation. I have posted a response to that thread, where a long time ago when I wrote my threading toolkit, that I experienced a similar situation where the pthread_cond_wait would experience a race condition with pthread_cond_signal (or at least seemed so). This happened rarely and only under a stress test. After a couple of weeks of trying to resolve this I replaced the pthread_cond_wait with a pthread_cond_timedwait. Then on the rare condition where the race condition occurred, the timeout would occur and the code would retest the wait condition (resume if satisfied, timedwait again if not). In that case and I suspect your case, the root of the cause is not with OpenMP but rather within the underlying pthread library. Caution, this is only a supposition on my part, none the less, the fact that pthread_cond_wait hanging and pthread_cond_timedwait, followed by timeout, followed by retest condition succeeding (not necessarily the condition variable but an independent variable set to indicate the pthread_cond_signal had been called), was a very strong indication that something is amiss within the pthread library.

I know that this does not solve your issue. Perhaps this information will provide a lead (or red herring) for the Intel support folks.

In your screen shot you show the stack of the main thread. Where are all the other threads sitting? (inside the parallel region or outside and in an idle thread holding spot)

Also, your screenshot shows 25 threads. While one may be an internal OpenMP watchdog thread or helper thread, is there anything else we need to know (for example are you running nested parallel regions or this region called from within an OMP Task).

Jim Dempsey

Michael_Barkhudarov · ‎09-15-2015

Thank you, Jim, I just found that thread. In fact, AGG on that threads works with me.

We are not running nested parallel regions and not calling this region from an OMP task. The simulation is running on 24 threads. I assume 0 refers to the main process and 1-24 to the spawned threads.

The run has since been terminated so I can't get more information. We are going to rerun it to reproduce the problem and query DDT more. We just started using DDT and have a meeting with the Allinea guys today to give us more training. I should be able to provide more information later today or tomorrow.

Michael_Barkhudarov · ‎09-15-2015

Jim, here is the stack output from threads 0, 10 and 24. The output from threads 1-23 is very much the same. Thread 0 is the master thread and threads 24 is some sort of a system thread.

Michael_Barkhudarov · ‎09-15-2015

Martyn_C_Intel · ‎09-15-2015

Here's a summary of what I posted to the other thread:

I reran similar tests on two different Windows systems (Windows 7 and Windows Server 2012 R2).. I was able to reproduce the hang with the version 15 compiler and RTL, but not with 16.0. This for both C and Fortran test cases.

The developers confirmed that the fix that was made to the OpenMP RTL in 16.0 was made for both Linux and Windows. Incidentally, this fix should also be present in the 15.0 update 5 compiler that will be posted very soon.

Please can you tell us more about the environment in which you see a problem with the 16.0 compiler? E.g. exact version of Windows; VS version; static or dynamic linking (I think OpenMP is always linked dynamically on Windows); the OpenMP RTL version as printed out when you set KMP_VERSION=yes; what sort of system you are running on (sockets, cores, threads, microarchitecture, 64 bit)? Better still if you can provide a test case to reproduce the problem.

jimdempseyatthecove · ‎09-16-2015

Michael,

From your two samples, the threads are hanging at an OpenMP barrier. The causes are:

Not all threads of the team entering the barrier. In your examination of threads 0:23, were all at the same barrier?
If not, then is the thread waiting at a different barrier (IOW you have multiple barriers in the region and control flow that permits partial number of team members to reach the barrier, this would be indicated by one or more team members stuck at a different barrier).
Or, you have some other activity that presents a deadlock or other race condition that is inhibiting one or more threads of the team from looping back to the barrier that is observed in the above screenshots (this may be exhibited by one or more threads running at the deadlock).
Or, if one or more threads are outside the do kjsum loop then those threads iterated fewer times than the other threads of the team (meaning kprb+jprf and/or kemx+jemx were not producing the same number of iterations for all team members).

If all the threads are at the same barrier, then this may be the pthread_cond_wait issue

Jim Dempsey

Michael_Barkhudarov · ‎09-16-2015

Thank you, Jim, we will examine the output from each thread from DDT. Just in case, here my yesterday's post on Javier's thread:

Martyn, glad you were able to reproduce the problem. Unfortunately, I cannot easily create a test case for you since the FORTRAN code is massive. Could I try your test case instead?

The 15.0.4 compiler did not work on either Linux (see AGG's post above for the flavor) or Windows.

Setting KMP_VERSION=yes produces the expected result on the Linux machine, where the 16th version works, but does not give any additional output on my Windows machine. We are linking dynamically using /MD and /Qopenmp options. When I type 'which libiomp5md.dll' is gives

C:/Program Files (x86)/IntelSWTools/compilers_and_libraries_2016.0.110/windows/redist/intel64/compiler/libiomp5md.dll

The Windows machine is running 64-bit version of Windows 7 Professional SP 1. OpenMP linked dynamically. The version of libiomp5md.dll is 20150609 (get it by running filever on it), same as on the Linux machine where it works.

The processor is i7-4930 3.4 GHz, one socket, 6 real cores, hyperthreaded (my problem fails both on 12 and 4 threads; have not tried other combinations). Let me know if you need more info.

I also attach an image from the dependency walker to show the dynamic libs the executable (hydr3d.exe) depends on, including the openmp library.

Martyn_C_Intel · ‎09-16-2015

Michael,

15.0.4 contains the compiler library bug that Vladimir and I have referred to before. We need to focus on the behavior of the fixed library that is found with the 16.0 compiler. Here's a very simple Fortran reproducer:

program Test
do i = 1, 1100000000
call sub
if(mod(i, 1000000).eq.0) print *, i
enddo
end program Test

subroutine sub
!$omp parallel do
do j = 1, 200
end do
end subroutine sub

Compile with ifort /Qopenmp Test.f90

Here's a C version:

#include<stdio.h>

void Test()
{
     #pragma omp parallel for
     for(int j=0;j<200;j++)
     {
     }
}

int main()
{
for (int i=0; i< 1100000000; i++)
{
Test();
if(i%1000000==0) printf("i=%d\n",i);
}
return 0;
}

icl /Qopenmp Test.cpp

Michael_Barkhudarov · ‎09-16-2015

Thank you, Martyn, i will try it.

Jim, according to DDT, all threads 0-23 are at the SAME location.