Trouble Parallelizing existing code with OpenMP

Colin_W_1 · ‎09-07-2017

OK so this is my 4th attempt to post this in the Windows Fortran Forum. This time I'll omit the attachments, and provide instead the URLs to the web page where one of the previous postings got to...

I have a 65 KLOC codebase that would be rather time-consuming and distracting to post here in its entirety. Instead, firstly, I attach a small example of a test program that represents, approximately, the structure of my actual code. See Source1.f90.

The main program calls DO4 in a loop. In DO4, there are 4 calls to DO1, each with some arguments the same between the calls, and some different. In DO1, there is some entirely artificial code whose purpose is to use up CPU time, in a way that the clever Intel optimizer cannot optimize out of existence. I compile this in 64-bit release mode with Compiler Version 17.0.2.187 Build 20170213 XE2017 Update 2. The run-time with no parallel flags is 141 seconds, on my 4-core win10 PC.

My objective is to arrange that the calls to DO1 occur in separate threads, thus achieving a four-fold speedup. To this end I added some OpenMP directives, see Source2.f90. I compiled this with /QParallel, /fpp, OpenMP conditional compilation, and /Qopenmp. This produced a reasonably pleasing result, the run time was 53 seconds, a 166% speedup

Now comes the funny stuff: I removed the !$OMP directives and re-ran, to see the effect of the compiler flags by themselves. Run-time was reduced to 34 seconds, a 314% speedup. ...So clearly, the directives were slowing things down. Hum... I surmised this was probably because the compiler was able to see the entire source in one small file, and so was able to do a better job of optimizing and parallelizing it without me telling it what to do. So, I split the file into 3, putting the main program and subroutines in their own separate files. This made the run-time go back to 141 seconds, despite the parallel flags being specified. And, when I added the directives back in, the run-time came down to 53 seconds again. (The separate files situation is a closer representation of my actual code.)

So: back in my actual code (64 kloc) the routines are all in separate files. There are 4 projects, one main program and 3 dlls. I compiled them all in release mode and ran a representative benchmark, which took 41 seconds. Adding the various parallel compiler flags resulted in a run time of 40 seconds, i.e. no significant difference, just like I saw in the test program split into 3 files. Thus I fully expected the addition of the OpenMP directives would make it go faster. Imagine My Horror when I found the run time was 49 seconds, a 25% slowdown.

I have played around with various incarnations of SHARED and PRIVATE declarations, all to no avail. I have also tried merging the files representing DO4 and DO1 into one module in one file. Clearly there is something preventing the code from parallelizing effectively. I can see the code is using multiple threads, as evidenced by the CPU chart in Windows' Task Manager.

I have attached the 2 source files of interest. Bo_deriv_flash.f90 contains the 4 calls I want to parallelize, at starting at line 145. Bo_flash.f90 contains subroutine bo_flash, starting at line 62. These both form part of one DLL. Alas you won't be able to compile them because they depend on a slew of other modules. I only hope that there is sufficient information available by an eyeball scan for someone knowledgeable to give me a hint at how I might get this going faster, please. Obviously I am missing something, please educate me!

All the arguments going into bo_flash are intent(in), apart from the 2nd and 3rd (x_res and ehsn). Bo_flash gets nearly all its data from its arguments; uses about 6 global scalars, and writes to none of them. It calls about 12 contained routines, plus about 6 others in separate files.

For the attachments, see my original posting, which somehow wound up at https://software.intel.com/en-us/node/743349

jimdempseyatthecove · ‎09-08-2017

Your Source2 runtimes on my system:

Core i7 2600K 4C 8T

Serial (OpenMP with Stubs)
  answer is    29690.1037658176
  time    141.912117004395


Parallel with !$OMP PARALLEL (all 8 threads)
  answer is    29690.1037658176
  time    90.0681517124176


Parallel with !$OMP PARALLEL NUM_THREADS(4)
  answer is    29690.1037658176
  time    47.2797043323517

When you know the extent of parallization, then use the directives to limit the team size for that region. You may also want to specify KMP_AFFINITY=scatter or use OMP_xxx equivalent environment variable settings.

*** Note: When Fortran procedures are called/invoked from a parallel region, the procedure must also be compiled with OpenMP (or recursive option/declaration or -auto:all). The default (none of the prior mentioned option) is to make local arrays SAVE (same with some user defined types).

Jim Dempsey