I have 3 tasks which are totally independent from one another and are therefore good candidates for parallel execution:
Task 1: Execute the (single-threaded) subroutine called subA().
Task 2: Execute the (single-threaded) subroutine called subB().
Task 3: Populate an array within a DO loop. Each iteration of the DO-loop is independent of all the others.
Suppose I have 8 threads. I'd like thread 0 to work on Task 1, thread 1 to work on Task 2, and threads 2-7 to work on Task 3. In Fortran, I imagine something like this:
COMPLEX*8, EXTERNAL :: func !$OMP PARALLEL !$OMP SECTIONS !$OMP SECTION ! ! Task 1, performed by one thread ! CALL subA() !$OMP SECTION ! ! Task 2, performed by one thread ! CALL subB() !$OMP END SECTIONS NOWAIT !$OMP DO ! ! Task 3, performed by all threads ! DO j=1,nn vals(j) = func(j) END DO !$OMP END DO NOWAIT !$OMP END PARALLEL
But the above code is not quite what I want. The threads that work on tasks 1 and 2 are also scheduled to work on the DO loop in task 3, which seems to slow everything down, presumably because those 2 threads are "late" arriving at the DO loop and therefore all the other threads must wait for them at the implicit barrier at the end of the PARALLEL region.
What is the proper way to handle the thread scheduling in a case like this?
(At the risk of providing too much information, I already know that subA() and subB() are compute-intensive, while each evaluation of func(j) is comparatively fast. It takes roughly as long for each of subA() and subB() to complete as it does for the entire DO loop to complete when several threads are assigned to the latter task.)
Thank you in advance,
ETA: Tim P. pointed out that my original question was ambiguous: it was not clear whether my DO loop was simply a memcpy() from func(1..nn) to vals(1..nn) or if I was calling a function called func() nn times. The latter was my intent and I clarified this in the example code.
- Parallel Computing
Intel experts on OpenMP have suggested asking general OpenMP questions on stackoverflow, appropriately tagged. One or the other of the Intel Fortran forums would also be appropriate if an answer from an ifort expert is wanted.
I guess you mean that the DO loop is independent of the parallel sections. You might try simply adding a num_threads clause to it, setting num_threads to 2 less than the number of threads available (according to omp_get_num_threads) at the start of the parallel. I'm assuming, if your application wants 1 thread per core in the presence of HT, that you arrange that with OMP_PLACES=cores and omp_get_num_places or by setting OMP_NUM_THREADS to number of cores.
You could instead experiment with schedule(dynamic,2) and the like if you want to pick up those other 2 threads once they become available. I suppose they will stall on starting their first chunk until their section completes, but that won't be as serious with small chunks. This might be simpler if you expect to get some advantage from HyperThreading; then your DO loop could be doing some work at a less than normal rate on the threads which are sharing cores with the sections.
My advice about schedule(dynamic,2) works for a case of my own, on the DO loop which runs by taking up threads as they leave the earlier nowait.
Yet on another similar case it's no good (runs correctly but slow). I'm testing on 4 cores HT disabled. It may have more cache locality problems on dual CPU.
nowait seems frequently to give rise to non-repeatable performance.
num_threads clause is not permitted at the level I had in mind, so that suggestion appears to be out. Apparently, it has to apply to the entire parallel region.
Thank you for your help. I followed your advice and started a similar thread on stackoverflow: https://stackoverflow.com/questions/37949734/omp-sections-and-do-in-the-same-parallel-block
As it happens, I had tried your proposed solutions (adding NUM_THREADS(N-2) to the DO loop and using DYNAMIC scheduling) prior to posting my question. I apologize for not including that information in my original post.
Like you, I had also observed that the NUM_THREADS clause is not accepted by the "OMP DO" construct, which is too bad because that is precisely what I hope to accomplish!
On the other hand, dynamic scheduling works correctly but the result is -- in my case -- unsatisfying, because the DYNAMIC scheduling seems to add a nontrivial amount of overhead, and therefore increases execution time of the DO loop by an unpleasant amount.
I just discovered that adding a schedule clause in my case with AVX2 usually prevents entering the simd vector branch at run time. I haven't found a reliable solution, so am back to omitting the schedule clause. This problem shouldn't arise with nested loops, outer parallel, inner simd.
Larger chunk values, e.g. dynamic,4 sometimes worked, but not consistently.
you might try this:
!$OMP PARALLEL if(omp_get_thread_num() .eq. 0) CALL subA() if((omp_get_num_threads() .eq. 1) .OR. (omp_get_thread_num() .eq. 1)) CALL subB() !$OMP DO SCHEDULE(DYNAMIC) ! ! Task 3, performed by all threads ! DO j=1,nn vals(j) = func(j) END DO !$OMP END DO !$OMP END PARALLEL
Note, experiment with a starting chunk size on the dynamic schedule.
If that DO loop does no more than what is shown, its performance is probably limited by the number of memory controllers in use. It may be worth while to test the loop inside an OMP SINGLE region.
If it doesn't automatically convert to a memcpy, or switch to streaming stores under -qopt-streaming-stores auto, according to your comments it should be big enough to set !dir$ vector nontemporal.
Tim P.: As Mr. Dempsey suspected, func() is a function and not an array. I revised my post to clarify this. Each call to func() requires a nontrivial amount of computation, but that computation is substantially less than the computation involved in subA() and sub(). Hence, there is a measurable benefit to multithreading the DO loop.
Mr. Dempsey: Your proposed solution involving dynamic scheduling works. However, I ultimately obtained the best performance by manually splitting up the DO loop based on the number of available threads, as I described more fully on stackoverflow. I expect that the performance benefits of my manual solution vs. the dynamic scheduling depend on the particular problem and must be evaluated on a case-by-case basis.
Thanks to both of you for your assistance.
Your Stack Overflow solution may become non-optimal when nn grows. Doubles, quadruples, etc...
Also, the OpenMP task has overhead. Experiment with something like this:
DO j=1,nn,2 ! or 3, ... !$omp task vals(j) = func(j) if(j+1 <= nn) vals(j+1) = func(j+1) !$omp end task END DO