Solved: Ifort auto-parallel has made

MAURICE_J_ · ‎01-25-2016

HI,

it's my very first visit on this forum and english is not my natural language. I am 61, with a long experience of parallel programming but my first time with FORTRAN + OMP.

I have a do loop in a PARALLEL DO clause. This loop is running 1329 times. 1326 iterations are running under 0.002s but 5 of them are 0.01s long and one of it is 0.1s on a WIN7-64 system. Those 6 iterations seems to be 'at random' but near the beginning.

Why ? Can I do something to avoid (prevent ?) the 6 long iterations ?

Thanks in advance

Jean in France

MAURICE_J_ · ‎01-25-2016

Thanks Tim and Jim for your quick answer.

Tim,

I must explain a little more : I have some arrays like jean(1329,20). I have a routine to populate line j of each array with some complicated calculations. In this routine I have the OMP PARALLEL and OMP END PARALLEL directives (so parallel do loops populate the 'columns'). Each line is independant from the others (I mean calculations for line j do not use data from line j-1 or j+1 or ....). The only private variable I have is the variable k for the loop 1 to 20. What I chronometer (time measure) is the duration of the whole routine.

Jim,

I going to take an aspirin before translating your message !!! I have understood that writing a 'fake' parallel section in the init of the app can improve time response of the parallel region. I will test that this night (the complete app takes more than 6 hours to complete). Another point : I don't think I have more than 20 threads at the same time.

May I ask another question ? As I said above, each line of arrays is independant of other lines. In fact, I can't have conflicts. Do I have to put arrays in the shared clause ? I suppose not but I would like to be sure.

Jean

View solution in original post

TimP · ‎01-25-2016

Among many possible measures to deal with such effects would be elimination of private data arrays . It's hardly unusual to require nested loops with both inner and outer loop counts greater than 1000 to see an advantage for omp parallel over simd vectorization alone.

jimdempseyatthecove · ‎01-25-2016

The very first parallel region in an application with OpenMP encounters the additional overhead of creating the additional threads for its thread pool. For timing applications you usually want to exclude this overhead (as an application is usually longer lived than a single instance of a parallel region). The usual practice is to add a parallel region at the front of the program that does some innocuous work (but that is not determined delete able by compiler optimizations).

For a timed region (one after the initial region to construct the thread team). The time it takes for each thread to enter this region depends on the state of additional threads at the time the region is reached by the main thread. If the additional team members have recently (100ms to 300ms) finished a prior parallel region then they can be immediately available to enter the parallel region. However, if a longer duration (than the KMP_BLOCKTIME) has elapsed, then the (some) additional threads may have been suspended. This results in a longer latency for the suspended threads to begin working in the parallel region.

The first iteration of a timed loop, threads may encounter a page fault the first time the thread touches a page of virtual memory of the process. For this reason, you typically do not include the first iteration of a timed loop. Note, if the timed loop contains memory allocations, then the first few iterations may encounter page faults, but you may have an unusual loop where each iteration acquires from the heap memory that has never been touched. In this situation, the overhead of page faults is an expected behavior.

Lastly, the thread team, usually constituted to the full capacity on the system, are not the only things running on your system. On my system while I type this response, the Task Manager shows ~2150 threads in 140 processes are running. The CPU Usage gauge varies between 0% and 1%. If during (or between) running an iteration, one of these other processes threads runs, then one or more of your process's threads will be suspended. And then this results in that (or those) threads not being able to start running in that iteration. Note, a thread of your application can also be preempted during the execution of the timed loop, and then this extends that thread's run time within that iteration of the loop.

The end of the PARALLEL DO has an implied barrier. Therefor, any delay in any thread starting the loop as well as delay for preemption of a thread execution of the loop, is a delay that affects all the team members of the loop. For this you can statistically remove the outliers for testing purposed (e.g. determining scaling). Note though, this overhead/interference will be expected in an actual application. It is not advisable to use the best times when making statements about performance improvements in general.

Jim Dempsey

MAURICE_J_ · ‎01-25-2016

Thanks Tim and Jim for your quick answer.

Tim,

I must explain a little more : I have some arrays like jean(1329,20). I have a routine to populate line j of each array with some complicated calculations. In this routine I have the OMP PARALLEL and OMP END PARALLEL directives (so parallel do loops populate the 'columns'). Each line is independant from the others (I mean calculations for line j do not use data from line j-1 or j+1 or ....). The only private variable I have is the variable k for the loop 1 to 20. What I chronometer (time measure) is the duration of the whole routine.

Jim,

I going to take an aspirin before translating your message !!! I have understood that writing a 'fake' parallel section in the init of the app can improve time response of the parallel region. I will test that this night (the complete app takes more than 6 hours to complete). Another point : I don't think I have more than 20 threads at the same time.

May I ask another question ? As I said above, each line of arrays is independant of other lines. In fact, I can't have conflicts. Do I have to put arrays in the shared clause ? I suppose not but I would like to be sure.

Jean

jimdempseyatthecove · ‎01-25-2016

The default is shared. However, if you add default(none), then you must explicitly state the shared variables and arrays.

You state the PARALLEL DO takes (on average) 0.002s, and that the application takes 6 hours to complete.

What portion of the application is in this loop?
Have you run VTune? And where does it state your time is spent?
What other portions of the program have you parallelized?

Note,

!$OMP PARALLEL DO
DO I=...
...
END DO
!$OMP END PARALLEL DO
... (SHORT TIME)
!$OMP PARALLEL DO
DO I=...
...
END DO
!$OMP END PARALLEL DO

------------ can be written as --------------

!$OMP PARALLEL
!$OMP DO
DO I=...
...
END DO
!$OMP END DO
!$OMP MASTER ! or possibly !$OMP SINGLE
... (SHORT TIME)
!$OMP END MASTER  ! or possibly !$OMP END SINGLE
!$OMP BARRIER ! may or may not be required (dependent on code)
!$OMP DO
DO I=...
...
END DO
!$OMP END DO
!$OMP END PARALLEL

The above reduces the number of entry/exits of parallel regions.

Jim Dempsey

MAURICE_J_ · ‎01-26-2016

first answer : the app has been written by hydrological engineers who learnt FORTRAN with a book. I have been rent to optimize the whole app. I began to 'simplify' code and earnt 30% of speed. I began to parallelize (?) 'simple' DO loops but with no wonderful results perhaps because of this 'simplicity'. It's the first time I try to parallelize a DO loop where the 'body' is a CALL instruction to a subroutine that is long enough to be 'measured' with SECNDS(). In other words, after having parallelized small parts, I try to work on a level above (is it understandable ??).

I want to understand what is happening before working on a lot of other parts of the app.

I am going to look at your code ...

Jean

BTW, I don't know if it is due to parallelization (?), but this night run gave a river running (flowing ?) up to the mountain !!!!

jimdempseyatthecove · ‎01-26-2016

For optimization there is a simple phrase to keep in mind: Parallel Outer, Vector Inner.

Efforts should be made to address vectorization for the inner most levels of code.
Then, efforts should be made to situate the entry into parallel regions to occur at the highest practical level.

In your initial post, you mention a loop you parallelized involves 1326 iterations of ~0.002 seconds, 2.652 seconds, out of 6 hours of run time (assuming the 1326 iterations in 6 hours). This seems to indicate that your parallelization efforts started at the innermost level. Not seeing the entire application it is hard to offer meaningful advice as to where the parallelization is to be applied.

Jim Dempsey

MAURICE_J_ · ‎01-27-2016

Hi Jim,

on a side note, I discovered that the Intel compiler has an 'autoparallelization' option that seems to work well and I am studying that.

To comme back with my loop : we are dealing with rivers : each small river has a source. The flow (quantity of water) of wich depends of rain, snow, earth type (?), temperature, sun, and so on. Then small rivers are gathering (?) to become a 'middle river' where you can find dams, lakes, farmers who steal the water (but then it comes back later) and so on.... And all the 'middle rivers' become a great river that goes into the sea. I have 1329 sources that are really independant and I am trying to parallelize only that part leaving, nowadays, rivers, dams, lake 'as is'. So my loops are in the very beginning of all the process.

Is it more clear ?

Jean

jimdempseyatthecove · ‎01-27-2016

Autoparallelization is typically suitable for only the simplest of problems.

If (when) each source calculation consumes approximately the same amount of time then a parallel do with static scheduling would be preferred. This is the default scheduling, clause SCHEDULE(STATIC), default can be overridden with environment variable. Static scheduling distributes iterations as evenly as possible amongst the available threads. As an example, on a 4 hardware thread system your loop of 1329 would partition the iteration space with three threads receiving 332 iteraitons and the remaining thread receiving 333 iterations.

When each source calculation differs but relative computation is small, then a different scheduling scheme is generally better. Look in the compiler documentation (OpenMP section) under the DO Directive. If you can organize your sources array in diminishing computational complexity, then consider SCHEDULE(GUIDED, beginningChunkSize). Experiment with beginningChunkSize = nIterations / nThreads / 2).

When each source calculation differs but relative computation is large, then consider static with chunk size of 1 SCHEDULE(STATIC,1).

For structure consider

nIterations = SimTimeInSeconds / SimStepInSeconds
...
!$OMP PARALLEL
nThreads = omp_get_num_threads() ! all threads overwrite global nThreads (in module)
! All threads in parallel integrate for the number of iterations in simulation
DO I=1,nIterations
  CALL SMALL_RIVER_ONE_STEP()
  CALL LARGER_RIVER_ONE_STEP()
  CALL LARGEST_RIVER_ONE_STEP()
  IF(ThisIterationNeedsAttention(I)) then
    !$OMP MASTER
    CALL MasterDoesItsThing()
    !$OMP END MASTER
    !$OMP BARRIER
  ENDIF
END DO
!$OMP END PARALLEL

! all threads call this routine
SUBROUTINE SMALL_RIVER_ONE_STEP()
  ...
  !$OMP DO ! You may want to experiment with schedule here
  DO I=1,nSmallRivers
    CALL DoRiverCalc(SmallRiver(I))
  END DO
  !$OMP END DO
  ! implied barrier at end of OMP DO
  ! *** still in parallel region, but outside of OMP DO
END SUBROUTINE SMALL_RIVER_ONE_STEP


! all threads call this routine
SUBROUTINE LARGER_RIVER_ONE_STEP()
  ...
  !$OMP DO ! You may want to experiment with schedule here
  DO I=1,nLargerRivers
    CALL DoRiverCalc(LargerRiver(I))
  END DO
  !$OMP END DO
  ! implied barrier at end of OMP DO
  ! *** still in parallel region, but outside of OMP DO
END SUBROUTINE LARGER_RIVER_ONE_STEP
...

Note, you probably have an undetermined number of levels of contributing rivers structure. For this you can have an array of river levels.

An alternative is to use OpenMP tasking in a recursive manner, but this may incur additional overhead, but may be acceptable:

!$OMP PARALLEL
!$OMP MASTER
CALL River(RootRiver) ! e.g. Mississippi
!$OMP END MASTER
!$OMP END PARALLEL
...

SUBROUTINE River(aRiver)
  USE MOD_River
  type(River_T) aRiver
  DO I=1,aRiver%nRivers
    !$OMP TASK
    CALL River(aRiver%Rivers(i)) ! recursively call
    !$OMP END TASK
  END DO
  !$OMP TASKWAIT
  ! additional work for aggregation
  ...
END SUBROUTINE River

It is usually a tradeoff between ease of programming and cost of execution.

Jim Dempsey

TimP · ‎01-27-2016

Ifort auto-parallel has made great improvements lately, and it can handle the simplest cases of workload balancing on a single CPU. As Jim pointèd out, the slogan of concurrent (threaded) outer, vector inner loop retains validity after 25 years and is difficult to discover in auto-parallel compilation. Still, it may save you some effort in your initial efforts at parallel optimization.

In my recent webinar, I pointed out some small cases where a source code optimization can unleash the power of auto-parallel compilation, and some where more intensive work with openmp is needed.,

MAURICE_J_ · ‎01-27-2016

I thank you two for your help. I have a lot to do to apply all your suggestions.

I don't know when the app will be able to deal with Mississipi ! My test datas are for a 240km long river ! And I have time to listen to 'old man river' while waiting for the app to complete !!

Thanks again.

Jean

jimdempseyatthecove · ‎01-28-2016

One of the most useful techniques I employ is to (attempt to) look at the problem as a whole to determine what can occur concurrently. This is in opposition to looking at profile data to look at hot spots, then address those in decreasing "hotness". While appreciating the problem as a whole, it is also helpful at times to ascertain if parts of the program can operate in different phases (adjacent time steps). Doing so may expose opportunities of pipelining. Not seeing your program, I can only guess to how you may have addressed the problem. Assume you have tiny tributaries, contributing to larger tributaries, contributing to larger tributaries, ... contributing to final part of river. A simplified (elegant) mathematical solution might carry the contributions from each of the tiniest tributaries all the way through to the final part of the river. Essentially, the mathematician expresses the problem in the most concise manner... without regard to the computational requirements. (computations are free). It is the programmer's objective take the expression and decompose it, then compose it, in a manner that while following the rules/design of the expression, are not necessarily sequenced as implied (expressed) in the original equation. The results of the program must agree with abstract mathematical expression.

In your simulation, it may be the case that you can structure the code such as in the recursive example. Each conjunction of contributory and next larger part of river is essentially the same computation. An observation that can be made from the recursive example is that the "! additional work for aggregation" has to wait for all contributories to reach the current time state. This is accomplished by recursively calling until you reach the final leaf of the contributory branches, then you work your way back down. This technique, while easy to express, may not necessarily be the most efficient means of computation. This technique begins with opportunities for parallelization starting with 1 thread, then 2 threads, then 4 threads, ... n threads, n threads, n threads, ... n/2 threads, n/4 threads, ... 4 threads, 2 threads, 1 thread. In the center section, you have good parallelization, but the ends weak.

A better technique might be (and I stress might) would be to invert the tree, compute the branch level away from final trunk, and begin computation at the furthest branches, this first step may or may not involve all the threads available to the process. when those extreme branches complete, they advance to the next time step, while their output is contributed to the current time step (at the next lower level).

Effectively what you construct is a parallel pipeline with multiple stages, with each stage operating in a different time step. From the observers viewpoint (report portion of program) the simulation did not begin until the first data comes out of the pipeline. What is important is that the throughput of the pipeline is the ability to keep all hardware threads busy.

Jim Dempsey

PARALLEL DO : problem with speed