Running program freezes after a while

Krticka_J_ · ‎12-16-2014

Hi,

I compiled my program with ifort (with ifort -parallel -O3). The program runs for a while, but the computations finally stop and the program is in the state "S" (interruptible sleep) according to ps. Here is the list of a minimal code that causes a problem:

      program bkmit
      implicit double precision (a-h,o-z)
      parameter(nr=1000,dt=5d1,nt=400000000,no=100,o1=1d-9,o2=1d-7)
      real*8 p(3,nr)

      do io=1,no
      o=o1
      do ir=1,nr
       p(1,ir)=0d0
       p(2,ir)=0d0
      enddo
      do it=1,nt
       do ir=2,nr-1
        p(3,ir)=p(1,ir)+1d7*(p(2,ir+1)-p(2,ir-1))
       enddo
       p(3,1)=-p(3,2)
       p(3,nr)=p(3,nr-1)
      enddo
      write(*,*) o
      enddo
      end

Note that for nr=100 the code works fine. Any idea what causes the problem? Thank you very much.

With best wishes

Jiri

jimdempseyatthecove · ‎12-16-2014

Jiri,

You should provide the version number of the compiler with problem reports. Also include processor types, number of processors and OMP_* and KMP_* environment variables. Essentially everything needed for Intel engineers to reproduce the problem.

For your version of compiler this is a bug, hanging is not an option.

There are two potential problems with your code and your expectations, neither of which should cause a hang:

a) The results of p(3,:) are not used outside the inner most two loops. The compiler optimization at -O3 may be smart enough to eliminate those two loops because they are not productive in that they do not produce usable results. The elimination of the loops likely occur in a later pass than the attempt to parallelize the inner loop or the parallelization of the collapsed inner two loops. This may produce a parallel region with nothing to do.

b) The loop size of the innermost loop at 100 may be less than the threshold for parallelization, that may account for the program running with 100. For the code inside the innermost loop, an iteration count of 1000 is likely too small to benefit from parallelization. The code shouldn't hang even if inefficient.

The compiler optimization may be even smart enough to realize p(1,:) and p(2,:) were initialized to 0.0 and reduce the statement of the innermost loop to p(3,ir) = 0.0, then observe that p(3,:) is not used outside the second inner most loop, thus removing that too.

It is recommended that for testing optimized code that there is at least the appearance that the code is producing results being used. Something like:

program bkmit
  implicit double precision (a-h,o-z)
  parameter(nr=1000,dt=5d1,nt=400000000,no=100,o1=1d-9,o2=1d-7)
  real*8 p(3,nr)

  do io=1,no
    o=o1
    do ir=1,nr
      p(1,ir)=io + ir ! obfuscate the test data
      p(2,ir)=io - ir ! obfuscate the test data
      p(3,ir)=0.0 ! assure p(3,1) and p(3,nr) are initialized
    enddo
    do it=1,nt
      do ir=2,nr-1
        p(3,ir)=p(1,ir)+1d7*(p(2,ir+1)-p(2,ir-1))
      enddo
      p(3,1)=p(3,1) - p(3,2) ! assure p(3,1) result per iteration used
      p(3,nr)=p(3,nr) + p(3,nr-1) ! assure p(3,1) result per iteration used
   enddo
   if(ISNAN(p(3,1)) write(*,*) "this should not print while using result in p()"
   write(*,*) o
 enddo
end program bkmit

Jim Dempsey

Krticka_J_ · ‎12-17-2014

Jim,

Thank you for your answer. The version of ifort is 15.0.1 20141023. The processor type is Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz (6 processors). The only variable which was set (except the library paths) is OMP_NUM_THREADS=3.

The code I posted is just a minimum version of a bigger code, which still produces a problem. The code worked fine with my older computer.

I also tried your code and it produces the same problem.

With best wishes

Jiri

Steven_L_Intel1 · ‎12-17-2014

What I see is that with nr=100 the program makes progress and the CPU cores are not saturated. With nr=1000, all the CPU cores go to 100% and no progress is made. I will send this on to the developers.

Steven_L_Intel1 · ‎12-17-2014

Looking at the optimization reports I see that when NR=100, the compiler chooses to not parallelize the inner loop because of "insufficient work", but for NR=1000 it does. Regarding the "sleep" state, my guess is that the OS noticed that the program was taking all of the cores and decided to "put it to sleep", though this is just a guess. Issue ID is DPD200364547.

Steven_L_Intel1 · ‎12-17-2014

The developers tell me that the problem here is only that the inner loop doesn't benefit from parallelization and you'll actually get better performance by disabling parallelization for that loop (by prefacing it with !DIR$ NOPARALLEL) It isn't hung - it's just taking much more CPU to do the work than it would otherwise. The behavior then triggers the OS to sleep the process.

jimdempseyatthecove · ‎12-17-2014

>>The behavior then triggers the OS to sleep the process.

Why would the O/S sleep an otherwise compute bound process?

What I assume is happening is, as you stated " the inner loop doesn't benefit from parallelization and you'll actually get better performance by disabling parallelization", that the overhead of creating the parallel region for the inner most loop has an overhead that is 1-2 orders of magnitude larger than the loop overhead had it run serial. The parallel version shouldn't have hung, but it would have taken much longer.

Try setting nt to a smaller number such that the inefficient parallel version takes a reasonable amount of time, then compare it against the serial version. Use omp_get_wtime() to get a reasonably accurate runtime.

Jim Dempsey

Krticka_J_ · ‎12-19-2014

This seems to be true, the code does not much benefit from the parallelization, the speed up of the original code with parallelization is about 1.5 (on three processors).

Jiri

jimdempseyatthecove · ‎12-19-2014

Your original test program had:

      do it=1,nt
       do ir=2,nr-1
        p(3,ir)=p(1,ir)+1d7*(p(2,ir+1)-p(2,ir-1))
       enddo
       p(3,1)=-p(3,2)
       p(3,nr)=p(3,nr-1)
      enddo

Which presumably represents in abstract form your actual program. Where the inner loop will be executed many times producing a small set of results.

     do iWork=1,nWorkItems
       do ir=2,nr-1
        p(3,ir)=p(1,ir)+1d7*(p(2,ir+1)-p(2,ir-1))
       enddo
       p(3,1)=-p(3,2)
       p(3,nr)=p(3,nr-1)
       call doSomethingWithResults()
      enddo

In this case, see if your problem can be structured such that your parallelization is made at the do iWork= level. This can be done using the OpenMP directives.

You will have to look at your problem, how you manage your input data, and how you handle your finished work data to see if you can perform the parallelization at the outer loop level.

Jim Dempsey

Krticka_J_ · ‎12-24-2014

Thank you for you advice, but this does not help. The code can not be parallelized on the iWork= level level, because the output from one loop is used in the another one (it is a wave simulation). It seems that I should live with the serial version.

Jiri Krticka

jimdempseyatthecove · ‎12-24-2014

Wave simulations may benefit from pipelining the process. Assume the serial program looks like:

do until done
calc 1
calc 2
...
calc n
end do

then you might be able to perform

do until done as pipeline
                                        calc 1.1;   calc 2.1;  ...; calc n.1
                           calc 1.2;   calc 2.2;  ...; calc n.2
              calc 1.3;   calc 2.3;  ...; calc n.4
calc 1.4;   calc 2.4;  ...; calc n.4
end do

Essentially you have one thread perform one of the calculation functions, then pass the data onto another thread/stage in/of the pipeline. Once the pass is made the passing thread/stage can accept input from the prior stage or next iteration if/when data is ready.

If for example, your wave is simulated with a large set of particles in a volume, then if you partition the volume into minor volumes, then once a minor volume advances, then next integration for that volume may begin. You may need some consideration for boundaries between minor volumes, but this often can be handled by processing the perimeter of the minor volume after you process the interior(s) of the minor volumes.

The trick (skill) in parallelizing this is to figure out what portions of the work can be perform concurrently and what cannot. By rearranging, re-sequencing, code, you can often introduce opportunities for parallelization where there formerly was none.

Jim Dempsey