overhead expense for FORALL statement

Izaak_Beekman · ‎01-23-2011

I will run some experiments, but does anybody have a sense of whether there's much overhead associated with FORALL statements? Should I compute intermediate variables all in one FORALL block (all functions of the only iINTENT(IN) variable, which is needed nowhere else in the procedure), or would it make more sense to compute each intermediate variable right before use in it's own FORALL statement?

TimP · ‎01-23-2011

I'm not certain what comparison you intend to make, so I won't try to guess the result. As discussed previously in these forums (probably under Windows), current ifort supports DO CONCURRENT, which usually is better optimized than FORALL. A FORALL with multiple assignments is the same as multiple FORALL with single assignment, so I guess you could say there is overhead.
In more complex cases, DO loops, possibly with VECTOR ALWAYS or VECTOR ALIGNED directives, will out-perform either DO CONCURRENT or FORALL, but I can't relate that to "overhead."

TimP · ‎01-23-2011

I'm sorry, your reply disappeared while I was attempting to answer.
DO CONCURRENT is Fortran 2008 syntax, implemented in the ifort xe 2011 releases.
The ifort !dir$ vector .... directives are described in the html documentation, in the docs directory of the compiler directory.

Izaak_Beekman · ‎01-24-2011

Wow that's a bit maddening, where did my post go?

"A FORALL with multiple assignments is the same as multiple FORALL with single assignment, so I guess you could say there is overhead" That's exactly what I meant by overhead, but I don't know what's happening behind the scenes.

Is DO CONCURRENT safe to use with MPI? This code will be parallelized via MPI block domain dfecomposition. For that matter are any of the "new" Fortran features which are concurrent/asynchronous safe to use with OpenMP and MPI? Don't they produce some sort of multithreaded code? If I have one MPI rank per core will this be an issue?

Finally I will include the small block of code that I was talking about earlier. fin is the only INTENT(in) variable and it's current value is not used again until the next time step. ISk is used multiple times throughout the function, and qk and TVk are used only once, TVk immediately upon completion of this vode block, and qk on the last executable line of the function, to compute the sole INTENT(out) variable (i.e. the function return value). The function itself is called at least 6 times at each grid point per time step, and the grid is usually between 5 and 150 million points. If anyone has any wisdom in terms of optimizing this portion, and the merrits of DO vs FORALL vs multiple FORALL (right before the computed variable is needed rather than all at once) any advice is welcome. (Oh, Also, the array dimensions here are small, r < 10, a4lk and d4lkm are both constant PARAMETERs)

[Fortran]    FORALL (k = 0:r)
       qk(k)  = SUM(a4lk( :,k)*  fin(k-r+1:k))
       ISk(k) = SUM(d4lkm(:,k,1)*fin(k-r+1:k))**2 + SUM(d4lkm(:,k,2)*fin(k-r+1:k))**2 + SUM(d4lkm(:,k,3)*fin(k-r+1:k))**2
# ifndef no_tv_limiter
       ! TV across each candidate stencil
       TVk(k) = SUM(ABS(fin(k-r+2:k) - fin(k-r+1:k-1)))
# endif
    END FORALL[/Fortran]

Steven_L_Intel1 · ‎01-24-2011

We don't parallelize FORALL. The semantics of FORALL inhibit parallelization. DO CONCURRENT can be parallelized when -parallel is specified - it uses OpenMP under the hood.

TimP · ‎01-24-2011

DO CONCURRENT or FORALL under ifort won't generate multiple threads, unless possibly they do so under -parallel compile option. You could find out if that happened with -par-report options. Those would be OpenMP threads, and (to comply with MPI standard) you would require the MPI_Init_thread() (most likely with the FUNNELED choice) in place of MPI_Init(). Not all MPI implementations make this distinction. You would also need attention to choosing the cores for the threads and number of threads so as to cooperate with the choices made by your MPI. Intel MPI has automatic thread affinity settings.
Normally, you would not want any threading in the application, so would not set -openmp or -parallel in your compilation, if you are already setting an MPI rank for each core. Even if you did have -openmp or -parallel threading, you could restrict it to single threads by setting OMP_NUM_THREADS=1 or by linking the OpenMP stubs library. The latter choice can be made at run time by setting LD_PRELOAD to the stubs library.

The new co-array feature to support cluster parallel with the ifort cluster tools edition is intended as a replacement for MPI, not to cooperate with MPI. I understand that colleagues with valuable opinions consider the latter as a requirement. That's the only "new" feature I can think of which conflicts with MPI.

In your code sample, assuming that the array sections are long enough (e.g. size > 50) to benefit from auto-vectorization, the first concern is to assure that each SUM is optimized. At that level, FORALL should not be a concern. I'm not able to guess whether auto-vectorization of SUM would be valuable at such a small length, which the compiler might be capable of with -xSSE4.1 or higher optimization, but a short loop which does one at a time could be very slow.
At the next level, you might want to assure that the **2 operations aren't repeated. It doesn't appear to me that FORALL should necessarily influence that either, as they all occur within a single assignment. However, if you used a DO loop, you could easily avoid repeated **2 by setting a scalar value for the **2 intermediate result.
As the array sections aren't big enough to pose cache locality issues, it may not matter whether FORALL inhibits fusion of the various operations (so as to read array elements only once).

Steven_L_Intel1 · ‎01-24-2011

DO CONCURRENT will generate threads, usually, with -parallel. I have seen it.