- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I compiled my program with ifort (with ifort -parallel -O3). The program runs for a while, but the computations finally stop and the program is in the state "S" (interruptible sleep) according to ps. Here is the list of a minimal code that causes a problem:
program bkmit
implicit double precision (a-h,o-z)
parameter(nr=1000,dt=5d1,nt=400000000,no=100,o1=1d-9,o2=1d-7)
real*8 p(3,nr)
do io=1,no
o=o1
do ir=1,nr
p(1,ir)=0d0
p(2,ir)=0d0
enddo
do it=1,nt
do ir=2,nr-1
p(3,ir)=p(1,ir)+1d7*(p(2,ir+1)-p(2,ir-1))
enddo
p(3,1)=-p(3,2)
p(3,nr)=p(3,nr-1)
enddo
write(*,*) o
enddo
end
Note that for nr=100 the code works fine. Any idea what causes the problem? Thank you very much.
With best wishes
Jiri
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jiri,
You should provide the version number of the compiler with problem reports. Also include processor types, number of processors and OMP_* and KMP_* environment variables. Essentially everything needed for Intel engineers to reproduce the problem.
For your version of compiler this is a bug, hanging is not an option.
There are two potential problems with your code and your expectations, neither of which should cause a hang:
a) The results of p(3,:) are not used outside the inner most two loops. The compiler optimization at -O3 may be smart enough to eliminate those two loops because they are not productive in that they do not produce usable results. The elimination of the loops likely occur in a later pass than the attempt to parallelize the inner loop or the parallelization of the collapsed inner two loops. This may produce a parallel region with nothing to do.
b) The loop size of the innermost loop at 100 may be less than the threshold for parallelization, that may account for the program running with 100. For the code inside the innermost loop, an iteration count of 1000 is likely too small to benefit from parallelization. The code shouldn't hang even if inefficient.
The compiler optimization may be even smart enough to realize p(1,:) and p(2,:) were initialized to 0.0 and reduce the statement of the innermost loop to p(3,ir) = 0.0, then observe that p(3,:) is not used outside the second inner most loop, thus removing that too.
It is recommended that for testing optimized code that there is at least the appearance that the code is producing results being used. Something like:
program bkmit implicit double precision (a-h,o-z) parameter(nr=1000,dt=5d1,nt=400000000,no=100,o1=1d-9,o2=1d-7) real*8 p(3,nr) do io=1,no o=o1 do ir=1,nr p(1,ir)=io + ir ! obfuscate the test data p(2,ir)=io - ir ! obfuscate the test data p(3,ir)=0.0 ! assure p(3,1) and p(3,nr) are initialized enddo do it=1,nt do ir=2,nr-1 p(3,ir)=p(1,ir)+1d7*(p(2,ir+1)-p(2,ir-1)) enddo p(3,1)=p(3,1) - p(3,2) ! assure p(3,1) result per iteration used p(3,nr)=p(3,nr) + p(3,nr-1) ! assure p(3,1) result per iteration used enddo if(ISNAN(p(3,1)) write(*,*) "this should not print while using result in p()" write(*,*) o enddo end program bkmit
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
Thank you for your answer. The version of ifort is 15.0.1 20141023. The processor type is Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz (6 processors). The only variable which was set (except the library paths) is OMP_NUM_THREADS=3.
The code I posted is just a minimum version of a bigger code, which still produces a problem. The code worked fine with my older computer.
I also tried your code and it produces the same problem.
With best wishes
Jiri
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What I see is that with nr=100 the program makes progress and the CPU cores are not saturated. With nr=1000, all the CPU cores go to 100% and no progress is made. I will send this on to the developers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Looking at the optimization reports I see that when NR=100, the compiler chooses to not parallelize the inner loop because of "insufficient work", but for NR=1000 it does. Regarding the "sleep" state, my guess is that the OS noticed that the program was taking all of the cores and decided to "put it to sleep", though this is just a guess. Issue ID is DPD200364547.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The developers tell me that the problem here is only that the inner loop doesn't benefit from parallelization and you'll actually get better performance by disabling parallelization for that loop (by prefacing it with !DIR$ NOPARALLEL) It isn't hung - it's just taking much more CPU to do the work than it would otherwise. The behavior then triggers the OS to sleep the process.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>The behavior then triggers the OS to sleep the process.
Why would the O/S sleep an otherwise compute bound process?
What I assume is happening is, as you stated " the inner loop doesn't benefit from parallelization and you'll actually get better performance by disabling parallelization", that the overhead of creating the parallel region for the inner most loop has an overhead that is 1-2 orders of magnitude larger than the loop overhead had it run serial. The parallel version shouldn't have hung, but it would have taken much longer.
Try setting nt to a smaller number such that the inefficient parallel version takes a reasonable amount of time, then compare it against the serial version. Use omp_get_wtime() to get a reasonably accurate runtime.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This seems to be true, the code does not much benefit from the parallelization, the speed up of the original code with parallelization is about 1.5 (on three processors).
Jiri
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your original test program had:
do it=1,nt
do ir=2,nr-1
p(3,ir)=p(1,ir)+1d7*(p(2,ir+1)-p(2,ir-1))
enddo
p(3,1)=-p(3,2)
p(3,nr)=p(3,nr-1)
enddo
Which presumably represents in abstract form your actual program. Where the inner loop will be executed many times producing a small set of results.
do iWork=1,nWorkItems
do ir=2,nr-1
p(3,ir)=p(1,ir)+1d7*(p(2,ir+1)-p(2,ir-1))
enddo
p(3,1)=-p(3,2)
p(3,nr)=p(3,nr-1)
call doSomethingWithResults()
enddo
In this case, see if your problem can be structured such that your parallelization is made at the do iWork= level. This can be done using the OpenMP directives.
You will have to look at your problem, how you manage your input data, and how you handle your finished work data to see if you can perform the parallelization at the outer loop level.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for you advice, but this does not help. The code can not be parallelized on the iWork= level level, because the output from one loop is used in the another one (it is a wave simulation). It seems that I should live with the serial version.
Jiri Krticka
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Wave simulations may benefit from pipelining the process. Assume the serial program looks like:
do until done
calc 1
calc 2
...
calc n
end do
then you might be able to perform
do until done as pipeline
calc 1.1; calc 2.1; ...; calc n.1
calc 1.2; calc 2.2; ...; calc n.2
calc 1.3; calc 2.3; ...; calc n.4
calc 1.4; calc 2.4; ...; calc n.4
end do
Essentially you have one thread perform one of the calculation functions, then pass the data onto another thread/stage in/of the pipeline. Once the pass is made the passing thread/stage can accept input from the prior stage or next iteration if/when data is ready.
If for example, your wave is simulated with a large set of particles in a volume, then if you partition the volume into minor volumes, then once a minor volume advances, then next integration for that volume may begin. You may need some consideration for boundaries between minor volumes, but this often can be handled by processing the perimeter of the minor volume after you process the interior(s) of the minor volumes.
The trick (skill) in parallelizing this is to figure out what portions of the work can be perform concurrently and what cannot. By rearranging, re-sequencing, code, you can often introduce opportunities for parallelization where there formerly was none.
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page