DO CONCURRENT allows the

Arjen_Markus · ‎09-15-2015

I am experimenting a bit with the DO CONCURRENT construct to see if it would improve the performance of one of our programs. Currently I am using Intel Fortran 15, so perhaps the observations I have made are no longer true.

Anyway, here is the basic code I use:

program doconcurrent
    implicit none

    integer, parameter  :: sz = 10000000
    real, dimension(sz) :: array
    integer             :: i, j, chunk, ibgn, iend, tstart, tstop

    call system_clock( tstart )
    do j = 1,10000
        do concurrent (i = 1:sz)
            array(i) = 10.0 * i * j
        enddo
    enddo
    call system_clock( tstop )

    write(*,*) array(1)
    write(*,*) tstop - tstart

end program doconcurrent

It does not do anything useful except exercise the DO CONCURRENT construct. But:

- Compiling it with and without -Qparallel gives roughly the same runtime, about 25 seconds. So no improvement whatsoever

- I can see that the program runs in 9 threads if I compile it with -Qparallel and with only one thread if I leave out that flag. Also the -Qpar-report flag indicates the loop is parallellized.

- If I insert a write statement to see if the iterations are run in a non-deterministic way, the loop is no longer parallellized.

- My theory was that the runtime is determined by the storing of the new values of the array and that the threads get in each other's ways. So instead of this one loop, I used an outer loop that split the loop in large chunks, something like:

    do j = 1,10000
        do concurrent (chunk = 1:8)
            ibgn = 1 + (chunk-1) * (sz+7)/8
            iend = min( chunk * (sz+7)/8, sz )
            do concurrent (i = ibgn:iend)
                array(i) = 10.0 * i * j
            enddo
        enddo
    enddo

But then only the inner loop is parallellized - if I use an ordinary do-loop for the inner one, nothing gets parallellized.

Any comments? An alternative - in this case - would be to use OpenMP, but the drawback of that is that I have define the "privateness" and "sharedness" of the variables involved myself ;).

TimP · ‎09-16-2015

This example offers opportunities for the compiler to recognize duplicate operations, so that there is no gain for parallelization once the available shortcuts have been taken.

The 16.0 compiler did improve optimization of do concurrent, to the point where it is the fastest alternative for one of my non-parallel examples from netlib vector benchmark. I don't think it's possible to guess when do concurrent will be the fastest choice, but now there isn't much penalty for using it when it may be the clearest expression even if it doesn't bring special optimizations.

Arjen_Markus · ‎09-16-2015

Hm, I guess I should experiment with a more useful and elaborate example then :). Pity, that will be more work.

On the other hand, as you say, using DO CONCURRENT does express clearly that no dependencies exist.

FortranFan · ‎09-16-2015

See this thread: https://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/539969

My observations generally have been that in order for DO CONCURRENT to be effective, there needs to be sufficient computational heft; otherwise, the overhead of "setting up" the parallel operations can overwhelm the benefits. But of course, the other value of DO CONCURRENT can be clarity in coding, as mentioned by Tim Prince.

jimdempseyatthecove · ‎09-16-2015

While DO CONCURENT is documented as potentially being parallized in conjunction with /Qparallel (auto-parallelization), can someone clarify the behavior in presence of /Qopenmp? (and presence of both options)

1) Is it cognoscente of OpenMP
2) Does it use its own thread team or re-use the OpenMP thread team(s)?
3) If re-use, does it implicitly follow OpenMP nested parallelization rules (regarding if nested enabled or not)

Jim Dempsey

TimP · ‎09-16-2015

With a suitable example, it should be easy to perform tests to answer Jim's questions.

My example of do concurrent is compiled with /Qopenmp, but doesn't parallelize. There's no indication in opt-report of parallelization being considered. The scalar loop cost of do concurrent is quoted 40% higher than for a C equivalent, so the reported vector speedup is exaggerated (do concurrent in ifort 16.0 just matches performance of Intel C).

My expectation is that ifort parallelizes do concurrent only under control of /Qparallel and is subject to OMP_NESTED. If Jim is asking about the case where /Qparallel or /Qopenmp might parallelize an outer scope containing a do concurrent, we'd have to test it. I wouldn't be surprised, if it's in the same compilation unit, that might cancel parallelization of the inner do concurrent.

Martyn_C_Intel · ‎09-16-2015

DO CONCURRENT allows the compiler to ignore any potential dependencies between iterations and to execute the loop in parallel. This can mean either SIMD parallelism (vectorization), which is enabled by default, or thread parallelism (auto-parallelization), which is enabled only by /Qparallel. This is independent of /Qopenmp, which does not enable auto-parallelization, it only enables parallelism through OpenMP directives. However, auto-parallelization with /Qparallel uses the same underlying OpenMP runtime library as /Qopenmp. The overhead for setting up and entering a parallel region is typically thousands of clock cycles, so auto-parallelization is usually worthwhile only for loops with a sufficiently large amount of work to amortize this overhead.

Try this simple variant of your loop. The math functions increase the amount of work.

program doconcurrent

implicit none

integer, parameter :: sz = 10000000

real, dimension(sz) :: array

integer :: i, j, chunk, ibgn, iend, tstart, tstop

call system_clock( tstart )

do j = 1,100

do concurrent (i = 1:sz)

array(i) = log(10.0 * i) + sin(0.003*i)

enddo

call system_clock( tstop )

write(*,*) array(1), array(sz)

write(*,*) tstop - tstart

end program doconcurrent

ifort /Qxavx /Qparallel Doconcurrent.f90

The optimization report shows that the inner loop is both vectorized and auto-parallelized (threaded). On my quad core laptop, it ran about twice as fast with /Qparallel. Because of the large array size, (40 MB), I had to increase both the overall program stack size with /F50000000 and the individual thread stacksize with SET OMP_STACKSIZE=50M to avoid run-time failures. In the case of DO CONCURRENT with /Qparallel, the compiler is generating a temporary private array, which it does not do for an OpenMP loop.

I don’t think it’s surprising that the compiler does not vectorize or auto-parallelize the loop when you insert a print statement. DO CONCURRENT is rather like using a !DIR$ IVDEP:LOOP directive on a DO loop – the compiler can ignore any potential dependencies between iterations, but it will not ignore proven dependencies. OpenMP can thread such a loop, but the result (the order in which the outputs are printed) is undefined and will vary from run to run.

Splitting up a loop into smaller chunks is definitely not the right way to go. As FortranFan also commented, you need plenty of work to make threading worthwhile.

It used to be that if you specified both /Qopenmp and /Qparallel, loops would not be auto-parallelized if any OpenMP loops were present. That no longer seems to be the case. I was able to thread an outer loop with OpenMP and still see an inner loop get auto-parallelized.

I don’t immediately know how the auto-parallel thread teams relate to OpenMP thread teams.

Since the same run-time library is used for both, many of the OpenMP environment variables apply also to auto-parallelism. Like Tim, I would expect, but haven’t tested, that auto-parallelized loops would be subject to OMP_NESTED at run-time.

Arjen_Markus · ‎09-17-2015

Thanks for that explanation. The reason I added the write statement was to see if indeed things were parallellised, so I expected to see a random sequence of numbers.

I used the example you gave on my laptop (8 cores shown in the Task Manager window, not sure whether that involves hyperthreading or not).

Without any options, the runtime of the program was 788 seconds. At the time it was not doing much else, well, apart from reading some e-mail, showing a document, no other computational programs.

With /Qparallel alone, I got an improvement of almost a factor 4: 204 seconds. The advantage was that I did not have to specify a large stack size (/F... or OMP_STACK_SIZE). It simply worked.

Withj /Qparallel Qxavx, however, I did have to do that and the runtime was slightly longer: 278 seconds.

I do not know how variable these performance numbers are, I suppose I could try to let it run a couple of times to get more insight in this - all that requires is patience.

Something I noticed, even with the simple loop I started with, is that the program will fill the entire machine (CPU usage 99%), if compiled with /Qparallel.

The option /Qxavx is not really useable for me (apart from the apparent slightly reduced performance), as I do not know in advance how much memory the actual program will require. It depends on the size of the problem it is supposed to solve, so we use allocatable arrays. /Qparallel alone, however, does hold promises. I will need to circumvent some loop dependencies, but I have an idea on how to do that.

Arjen_Markus · ‎09-17-2015

As to be expected, the runtimes vary quite a bit. I ran the program in two series of six, once compiled with /Qparallel and once with /Qparallel /Qxavx. Here are the raw numbers:

/Qparallel: 260, 204, 211, 215, 218 and 232 seconds

/Qparallel /Qxavx: 275, 279, 269, 205, 202, 270 seconds

It would seem /Qparallel is slightly faster, but given the variability the evidence is flimsy. It does have the advantage of not requiring any environment variables.

Arjen_Markus · ‎09-17-2015

I modified the source for one of the routines in the program I want to optimise and used the /Qpar-report option to see if the one loop I modified was indeed parallellised. Curiously enough the report does not show anything about that loop. The report contains all manner of stuff about array assignments and every other loop, just not this one.

Here is an excerpt from the code with line numbers:

161       rhs      = conc
162       diag     = deriv
163       acodia   = 0.0
164       bcodia   = 0.0
165       disp0q0  = btest( iopt , 0 )
166       disp0bnd = btest( iopt , 1 )
167       loword   = btest( iopt , 2 )
168
169 !         Loop over exchanges to fill the matrices
170
171       nolevel = 4
172       ! DO NOT USE "DO CONCURRENT" FOR THE OUTER LOOP!
173       do level = 1,nolevel
174       do concurrent (iq = 1 : noq)
175
176 !         Initialisations, check for transport anyhow
177
178          ifrom = ipoint(1,iq)
179          ito   = ipoint(2,iq)
180
             ...
268
269       enddo
270       enddo
271
272 !    Now make the solution:  loop over exchanges in the water
273
274       do iq = 1 , noqw
275          ifrom = ipoint(1,iq)
276          ito   = ipoint(2,iq)
277          ...

And here is the relevant part of the report:

Begin optimization report for: DLWQD1

    Report from: Auto-parallelization optimizations [par]


LOOP BEGIN at D:\delft3d-waq-parallel\engines_gpl\waq\packages\waq_kernel\src\waq_kernel\dlwqd1.f(161,7)
<Peeled>
LOOP END

LOOP BEGIN at D:\delft3d-waq-parallel\engines_gpl\waq\packages\waq_kernel\src\waq_kernel\dlwqd1.f(161,7)
   remark #25460: No loop optimizations reported
LOOP END

LOOP BEGIN at D:\delft3d-waq-parallel\engines_gpl\waq\packages\waq_kernel\src\waq_kernel\dlwqd1.f(161,7)
   remark #25460: No loop optimizations reported
LOOP END

LOOP BEGIN at D:\delft3d-waq-parallel\engines_gpl\waq\packages\waq_kernel\src\waq_kernel\dlwqd1.f(161,7)
<Remainder>
LOOP END

LOOP BEGIN at D:\delft3d-waq-parallel\engines_gpl\waq\packages\waq_kernel\src\waq_kernel\dlwqd1.f(162,7)
<Peeled>
LOOP END

LOOP BEGIN at D:\delft3d-waq-parallel\engines_gpl\waq\packages\waq_kernel\src\waq_kernel\dlwqd1.f(162,7)
   remark #25460: No loop optimizations reported
LOOP END

LOOP BEGIN at D:\delft3d-waq-parallel\engines_gpl\waq\packages\waq_kernel\src\waq_kernel\dlwqd1.f(162,7)
   remark #25460: No loop optimizations reported
LOOP END

LOOP BEGIN at D:\delft3d-waq-parallel\engines_gpl\waq\packages\waq_kernel\src\waq_kernel\dlwqd1.f(162,7)
<Remainder>
LOOP END

LOOP BEGIN at D:\delft3d-waq-parallel\engines_gpl\waq\packages\waq_kernel\src\waq_kernel\dlwqd1.f(163,7)
   remark #25460: No loop optimizations reported

   LOOP BEGIN at D:\delft3d-waq-parallel\engines_gpl\waq\packages\waq_kernel\src\waq_kernel\dlwqd1.f(163,7)
      remark #25460: No loop optimizations reported
   LOOP END

   LOOP BEGIN at D:\delft3d-waq-parallel\engines_gpl\waq\packages\waq_kernel\src\waq_kernel\dlwqd1.f(163,7)
   <Remainder>
   LOOP END
LOOP END

LOOP BEGIN at D:\delft3d-waq-parallel\engines_gpl\waq\packages\waq_kernel\src\waq_kernel\dlwqd1.f(164,7)
   remark #25460: No loop optimizations reported

   LOOP BEGIN at D:\delft3d-waq-parallel\engines_gpl\waq\packages\waq_kernel\src\waq_kernel\dlwqd1.f(164,7)
      remark #25460: No loop optimizations reported
   LOOP END

   LOOP BEGIN at D:\delft3d-waq-parallel\engines_gpl\waq\packages\waq_kernel\src\waq_kernel\dlwqd1.f(164,7)
   <Remainder>
   LOOP END
LOOP END

LOOP BEGIN at D:\delft3d-waq-parallel\engines_gpl\waq\packages\waq_kernel\src\waq_kernel\dlwqd1.f(273,7)
   remark #25460: No loop optimizations reported

   LOOP BEGIN at D:\delft3d-waq-parallel\engines_gpl\waq\packages\waq_kernel\src\waq_kernel\dlwqd1.f(277,10)
   <Peeled>
   LOOP END

   LOOP BEGIN at D:\delft3d-waq-parallel\engines_gpl\waq\packages\waq_kernel\src\waq_kernel\dlwqd1.f(277,10)
      remark #25460: No loop optimizations reported
   LOOP END

   LOOP BEGIN at D:\delft3d-waq-parallel\engines_gpl\waq\packages\waq_kernel\src\waq_kernel\dlwqd1.f(277,10)
      remark #25460: No loop optimizations reported
   LOOP END

As you can see there is no report on the loop that starts at line 173 (and several loops inbetween) and ends at line 269.

TimP · ‎09-17-2015

The absence of a report on a loop may indicate that the compiler has optimized it away as redundant.

I don't agree that /Qparallel eliminates the need for environment variables. OMP_NUM_THREADS, OMP_PLACES, etc. play the same roles as under OpenMP, and are particularly important when HyperThreading is enabled. Your varying run times may confirm their importance. If you have a single CPU symmetric multi-core platform without HT you may be able to avoid the environment variables.

Arjen_Markus · ‎09-17-2015

Now that would be awkward, as it implements the initialisation of a number of arrays (indices depending on the loop variable) - it is also not entirely trivial.

I removed the "DO CONCURRENT" construct and the outer loop (which does not do anything useful yet), but the loop is still not reported about.

As for the environment variables, that is useful to know. I was merely referring to the need for /F... and OMP_STACKSIZE, as those parameters clearly depend on the problem that is to be solved.

Martyn_C_Intel · ‎09-18-2015

I think you should set the KMP_AFFINITY environment variable, especially if you have hyperthreading enabled, as seems likely. That may make your timings more consistent from run to run. E.g.

Set KMP_AFFINITY=scatter (or set OMP_PROC_BIND=spread, which I think is similar).

It’s strange that you have no message for the loops at 173 and 174. (I take it that 161 – 164 are array assignments). What options are you using? I would use just

/Qopt-report:3 and optionally /qopt-report-file:stderr

and not use anything like /Qpar-report or /Qvec-report. The optimization reports were completely reimplemented in the 15.0 compiler; these older switches are deprecated and don’t map very cleanly to the new reports. For more about the new reports, see for example https://software.intel.com/en-us/videos/getting-the-most-out-of-the-intel-compiler-with-new-optimization-reports or the related article at https://software.intel.com/sites/default/files/managed/4c/1c/parallel_mag_issue19.pdf

I agree that the only way you might get no message for a loop should be if it is completely optimized away, e.g. if the quantities it calculates are never used. If it is simply merged with other loops, you should still get a “fusion” message, along with an “empty” loop in the report consisting only of the LOOP BEGIN and LOOP END message.

It’s surprising that your code should run slower and take more memory with /Qxavx. This enables additional instructions and optimizations. Normally, you should use it if your processor supports the corresponding instruction set. Perhaps /Qxavx encourages the compiler to vectorize loops, that turn out to have very small trip counts at run time. Taking significantly more stack space is really very unusual.

Incidentally, you can set the OpenMP thread stacksize at runtime, if you choose, by a call to KMP_SET_STACKSIZE_S(). This lets you choose the thread stacksize according to the size of your private arrays. I don’t think there’s a penalty for setting the argument of /F to be very large, if you don’t ever actually use that memory. I tried setting it to 100GB (on my laptop with 4GB) and it didn’t make any difference.

Arjen_Markus · ‎09-21-2015

Ah, the use of /Qopt-report:3 did the trick: the loop was not optimised at all. When I used /O3 in addition, the report showed that vectorisation was possible but probably not worth the effort. The report is definitely more detailed than with the (deprecated) /Qpar-report option - I used that one because I could not find the new one (or overlooked it). This should come in handy.

As for controlling affinity, I have not tried that yet, but it is good to know that you can influence the parallel calculation using such environment variables and runtime routines.

Arjen_Markus · ‎09-22-2015

To see whether the busiest loops in my program can be parallellised - I have to take care of a loop dependency, but I have found a solution, I created a small subroutine that exhibits the sort of loops I am dealing with and the solution I have in mind:

! example.f90 --
!     Example of the type of loop and a solution to eliminate the loop dependencies
!
subroutine example( iq, amat )
    implicit none

    integer, dimension(:,:) :: iq
    real, dimension(:,:)    :: amat

    integer                 :: i, first, second

    do i = 1,size(iq,2)
        first  = iq(1,i)
        second = iq(2,i)

        amat(first,second) = -1.0
        amat(second,first) =  1.0
    enddo
end subroutine example

subroutine example_no_deps( iq, amat )
    implicit none

    integer, dimension(:,:) :: iq
    real, dimension(:,:)    :: amat

    integer                 :: i, first, second, level

    do level = 1,4
        do concurrent (i = 1:size(iq,2))
            first  = iq(1,i)
            second = iq(2,i)

            if ( iq(3,i) == level ) then
                amat(first,second) = -1.0
                amat(second,first) =  1.0
            endif
        enddo
    enddo
end subroutine example_no_deps

The first routine cannot be safely parallellised, because the indices first and second can be anything. The second routine solves that by allowing only the set of updates that does not get in each other's way. So far, so good.

The report I get if I compile this source with: -c -Qparallel -Qopt-report:5 is shown below (only example_no_deps shown):

Begin optimization report for: EXAMPLE_NO_DEPS

    Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (EXAMPLE_NO_DEPS) [2/2=100.0%] d:\tmp\example.f90(21,12)


    Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]


LOOP BEGIN at d:\tmp\example.f90(30,9)
   remark #17104: loop was not parallelized: existence of parallel dependence
   remark #17106: parallel dependence: assumed OUTPUT dependence between AMAT line 35 and AMAT line 36
   remark #17106: parallel dependence: assumed OUTPUT dependence between AMAT line 36 and AMAT line 35
   remark #15541: outer loop was not auto-vectorized: consider using SIMD directive

   LOOP BEGIN at d:\tmp\example.f90(30,9)
      remark #17109: LOOP WAS AUTO-PARALLELIZED
      remark #17101: parallel loop shared={ .2.8_2upper_.1 } private={ } firstprivate={ .T92_ FIRST SECOND LEVEL I } lastprivate={ } firstlastprivate={ } reduction={ }
      remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or /Qvec-threshold0 to override
      remark #15460: masked strided loads: 3 
      remark #15463: unmasked indexed (or scatter) stores: 2 
      remark #15475: --- begin vector loop cost summary ---
      remark #15476: scalar loop cost: 27 
      remark #15477: vector loop cost: 82.000 
      remark #15478: estimated potential speedup: 0.320 
      remark #15479: lightweight vector operations: 46 
      remark #15480: medium-overhead vector operations: 3 
      remark #15481: heavy-overhead vector operations: 1 
      remark #15487: type converts: 4 
      remark #15488: --- end vector loop cost summary ---
      remark #25439: unrolled with remainder by 2  
   LOOP END

   LOOP BEGIN at d:\tmp\example.f90(30,9)
   <Remainder>
   LOOP END
LOOP END

LOOP BEGIN at d:\tmp\example.f90(30,9)
   remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or /Qvec-threshold0 to override
   remark #15460: masked strided loads: 3 
   remark #15463: unmasked indexed (or scatter) stores: 2 
   remark #15475: --- begin vector loop cost summary ---
   remark #15476: scalar loop cost: 27 
   remark #15477: vector loop cost: 82.000 
   remark #15478: estimated potential speedup: 0.320 
   remark #15479: lightweight vector operations: 46 
   remark #15480: medium-overhead vector operations: 3 
   remark #15481: heavy-overhead vector operations: 1 
   remark #15487: type converts: 4 
   remark #15488: --- end vector loop cost summary ---
   remark #25439: unrolled with remainder by 2  
LOOP END

LOOP BEGIN at d:\tmp\example.f90(30,9)
<Remainder>
LOOP END
===========================================================================

The confusing bit is that it says both that the loop at line 30 (do concurrent (i = 1:size(iq,2)) is not parallellised and that it is parallellised.

Can I assume that it is in fact parallellised? (Whether this is useful, remains to be seen, because the solution to the loop dependency requires multiple passes over the array iq. I am hopeful though, because the real loop is fairly long and does a lot of things, so that the overhead of selecting only a fraction should be small in comparison)

TimP · ‎09-22-2015

When the report indicates generation of both a parallel and a non-parallel version, there may in fact be 2 versions, with selection at run time. It might be interesting to find out whether Advisor will show how much time is spent in each; VTune should do so, but it may be necessary to find the time-criticalt thread in case there are idle ones.

Arjen_Markus · ‎09-22-2015

Yes, I had not thought of that one. Thanks. I think I will go and try parallellising the program now ;).

Questions about DO CONCURRENT