Solved: OpenMP overhead in Intel Fortran

Kevin_McGrattan · ‎04-28-2022

I recently discovered that the OpenMP directives in my Fortran code add 20% to the CPU time of a single-threaded job, compared to a single-threaded version that is not compiled with OpenMP. I have read forum posts about OpenMP overhead, but I wonder if my slowdown is caused by the fact that the OpenMP directives around my (mainly) big 3D DO-LOOPS partially defeat the compiler's optimization. I am using the latest version of the classic Fortran compiler, with -ipo -O2 optimization. Has anyone else had similar experience, and if so, is there any work-around besides just compiling two executables?

jimdempseyatthecove · ‎04-29-2022

>>why would threaded OpenMP code like to sequential MKL?

Simple example:

System: 4 core, 8 threads

OpenMP using 8 threads

MKL threaded library uses OpenMP internally, this presents the (somewhat) equivalent of nested parallelism with each (of host thread making an MKL call, to which MKL's (for that thread) instantiates a thread pool of 8 threads. IOW 8x8 threads for processing (i.e. oversubscription).

Now then, should you ONLY call MKL from the sequential portion of the process .OR. from the master (or single) thread, then (and only then) you may get some performance back by setting KMP_BLOCKTIME=0, and use the threaded MKL library. IOW when exiting a parallel region (and/or reaching barrier) the main process threads with no work available will immediately suspend running as opposed to running up to a spin wait time (default ~300ms). This will happen both for main code threads as well as MKL threads.

Alternatives methods are to limit the threads on both then OpenMP side and MKL side (as well as proper thread pinning). This is a non-trivial task, and has to be thought out carefuly.

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎04-28-2022

Please show your !$omp loop structure (indicies nesting order and indexing usage within the loops).

Also indicate the counts for each loop level..

Jim Dempsey

Kevin_McGrattan · ‎04-28-2022

   !$OMP DO SCHEDULE(STATIC)
   DO K=1,KBAR
      DO J=0,JBAR
         DO I=1,IBAR
            VS(I,J,K) = V(I,J,K) - DT*( FVY(I,J,K) + RDYN(J)*(H(I,J+1,K)-H(I,J,K)) )
         ENDDO
      ENDDO
   ENDDO
   !$OMP END DO NOWAIT

This is a typical loop, where IBAR, JBAR and KBAR might be anywhere from 20 to 100.

IanH · ‎04-28-2022

It should be straight forward enough to directly inspect the assembly, generated with OMP on and off, for that stretch of code to see whether the optimiser is doing different stuff.

(When you say "single threaded job" - do you mean that OMP_NUM_THREADS is effectively one, or do you mean the time for a single thread amongst many to execute a particular stretch of code?)

Kevin_McGrattan · ‎04-28-2022

Yes, I'm learning how to use Advisor at the moment. This is a huge code, so it is difficult to find my way around. I'll figure it out.

By "single-threaded job", I am comparing a job with a single OpenMP thread vs the same job with the code compiled without OpenMP. In other words, the OMP statements are ignored.

IanH · ‎04-28-2022

I was suggesting that you compile a file containing a suspect loop using /Fa (or whatever its equivalent is on Linux) - with and without OMP, and then compare the resulting assembly that implements the loop for the with and without case.

jimdempseyatthecove · ‎04-28-2022

At the smallest iteration counts, the total internal statement execution is only 800x. Assuming vectorized code on AVX512 this would be ~100 times (with aligned arrays). AVX/AVX2 would be 200x. Therefore at the lowest iteration count the parallel region entry/exit overhead would be too excessive for practical use. At the highest level, you would have sufficient number of iterations to parallelize.

Note, running Omp loop with 1 thread will still have some thread team startup/end overhead.

Also, your shown code has !$OMP DO... (without PARALLEL). This indicates that your parallel region is at an outer scope level. It would help to show the complete context of the parallel region... .AND. know if your timing is for the 1st iteration of the loop. The first execution of the shown code will (tend to) populate the L1, L2 and L3 cache systems. The 1st pass tends to be significantly slower than the subsequent passes.

Jim Dempsey

Kevin_McGrattan · ‎04-29-2022

Thanks for the info, but I'm rapidly reaching my level of incompetence here. We added relatively simple OpenMP directives to our code years ago. We also use MPI. With MPI, we just divide the 3D physical domain into individual grids and farm each out to its own MPI process. This works well, and we can divide our domains into hundreds of sub-grids and get relatively good scaling. For example, I can run a job 100 times faster using 200 sub-grids.

The OpenMP parallelization is not nearly as good. For a single grid, we can get, at best, a factor of 2 speed up, even when using, say, a dozen cores. With a dozen cores, we can just divide the overall grid into 12 sub-grids and get much better scaling. But there are instances where we want to make a simulation on a single grid run faster, and for that we need OpenMP.

We are trying to decide whether to invest more time in the OpenMP calls, or whether to just use MPI to do all parallelization. That is why I asked if anyone finds it unusual that simply removing the OpenMP directives (i.e. not using the -qopenmp) option gives us a 20% speed up by reducing the OpenMP overhead and/or allowing for better optimization of the loops.

jimdempseyatthecove · ‎04-29-2022

Is your (non-shown) code making calls to MKL?

If it is, which MKL library are you linking to?

Typically when using threaded code (OpenMP), you should link with the sequential MKL library,

Conversely, typically when using sequential code (MPI single thread rank), you should link with the parallel MKL library (provided the rank has multiple cores/threads) available.

With combinations of MPI and OpenMP there is a bit more thread tuning that must be done.

Jim Dempsey

Kevin_McGrattan · ‎04-29-2022

Our code is 10s of thousands of lines long. We do link to MKL, but for the timing tests I am doing, we are not using any MKL routines. In any event, why would threaded OpenMP code like to sequential MKL? Wouldn't I want both to use the threaded libs?

jimdempseyatthecove · ‎04-29-2022

>>why would threaded OpenMP code like to sequential MKL?

Simple example:

System: 4 core, 8 threads

OpenMP using 8 threads

MKL threaded library uses OpenMP internally, this presents the (somewhat) equivalent of nested parallelism with each (of host thread making an MKL call, to which MKL's (for that thread) instantiates a thread pool of 8 threads. IOW 8x8 threads for processing (i.e. oversubscription).

Now then, should you ONLY call MKL from the sequential portion of the process .OR. from the master (or single) thread, then (and only then) you may get some performance back by setting KMP_BLOCKTIME=0, and use the threaded MKL library. IOW when exiting a parallel region (and/or reaching barrier) the main process threads with no work available will immediately suspend running as opposed to running up to a spin wait time (default ~300ms). This will happen both for main code threads as well as MKL threads.

Alternatives methods are to limit the threads on both then OpenMP side and MKL side (as well as proper thread pinning). This is a non-trivial task, and has to be thought out carefuly.

Jim Dempsey