Extremely poor OpenMP performance in 2019 version of Fortran Compiler

Ioannis_K_ · ‎09-10-2019

Hello,

I have developed a large program using the 2013 edition of Intel's parallel studio (Professional Edition), with the Visual Fortran Compiler (integrated into Visual Studio 2012).

I recently downloaded the trial of the 2019 Parallel Studio (Cluster edition) with Visual Studio 2017, to see if the new version of the compiler leads to better performance with my CPU model (It is a Xeon E5-2697 v3).

In my benchmarks, I run the same exact code for the same level of optimization (O2), with all default options, apart from using Heap = 0.

To my surprise, the multi-threaded performance of my code when built with the new version of the compiler is abysmal, compared to the same exact code using the 2013 version of the compiler. I verified that the two compilers give similar performance for a single-thread execution of the code. A multi-threaded execution using 8 threads gives much faster execution when I use the 2013 edition of the compiler. Using 8 threads with the new edition of the compiler gives slower execution than the case with 1 thread! I am convinced that this is an issue of the windows-based version of the compiler, as the Linux-based version gives the expected performance enhancement when I use multiple threads.

In light of the above, I wanted to ask whether anyone has an idea what may be causing such slow-down when I use the newer version of the compiler. Could there be a default option/setting in the compiler or in Visual Studio 2017 which was not present in the old versions?

Thanks in advance for any suggestions and advice.

j0e · ‎09-13-2019

From my own extremely limited experience with OpenMP, usually when execution speed slows in going from a single thread to multiple threads indicates I have a subtle data race (and I get different responses wrt Windows vs Linux with the same code, but both compiled with Intel). Have you tried running Intel Inspector on the code set to look for Data Races?

Ioannis_K_ · ‎09-13-2019

What I cannot understand, is why would THE SAME EXACT code have VASTLY DIFFERENT BEHAVIOR (with respect to OpenMP) when compiled with different versions of the Intel Compiler, and why the NEWEST version of the Intel compiler is tremendously slower (for any number of threads, other than 1) than the 2013 version of the same compiler. I cannot understand how the NEW VERSION of the compiler would "encounter a data race", while the OLD VERSION of the same compiler would not.

jimdempseyatthecove · ‎09-15-2019

Can you verify that the Intel OpenMP runtime library DLL that loads with your program is the same version that compiled your program (check the paths).

Jim Dempsey

jimdempseyatthecove · ‎09-15-2019

And, does the Windows Task Manager (Performance Tab) show all threads running?

What are your environment variables (as used by Intel OpenMP)? These will be KMP_... and OMP_...

Jim Dempsey

Ioannis_K_ · ‎09-15-2019

Jim, thank you for bringing this up. I myself have the suspicion that some kind of incompatibility may be causing this (especially in light of the fact that Linux builds using the new editions of the Fortran compiler do not exhibit the problem I am obtaining for Windows builds). I had previously installed the 2013 version of the compiler, before uninstalling it and installing the 2019 version. Similarly, I had Visual Studio 2012, which I have now replaced with Visual Studio 2017. I am not sure whether the previous installation led to an incompatibility.

If possible, can you please provide me with steps to check the paths that you mentioned? Also, if it so happens that I do have an incompatibility in the openMP library, can you please explain to me how I would be able to resolve it?

Yannis

jimdempseyatthecove · ‎09-15-2019

The Intel OpenMP runtime library is libiomp5.dll. I see you are using Microsoft Visual Studio 2012. While the installation of Intel Parallel Studio (any version) should properly setup the runtime environment variables for running within MS VS IDE, and it should setup environment for running your program external to MS VS IDE, there are situations where you have competing factors as to which DLL gets loaded. In particular if your happen to have run the Intel Redistributables (for use on Intel built programs on system without Intel Parallel Studio installation).

One of the better ways to identify which runtime library did load with the application is http://www.dependencywalker.com/

Now, I must state that I have not used this tool. The screenshot on the above page does not show full path to the dll, but I suspect the full path is obtainable.

If you find an incorrect path (not that of the Intel PS Version 19), then your system environment variable PATH may be incorrectly setup.

You can Google how to edit the system PATH variable for the version of Windows you have.

Note, Microsoft's system environment variable editor is not friendly for long paths, If you are not comfortable with it, then you may be able to locate a utility on line. (be careful of downloads)

Jim Dempsey

jimdempseyatthecove · ‎09-15-2019

See: https://www.nextofwindows.com/three-alternative-windows-environment-path-editor

Jim Dempsey

Ioannis_K_ · ‎09-15-2019

Jim, thank you for all this information.

I have checked Windows task manager, and the computer does indeed use all the threads that I am requesting.

I will try to see the issue you mentioned with the runtime environment variables. I will also try to run two versions of the program in Vtune Amplifier, to check whether there is any major discrepancy on where the bulk of the execution time is spent. I will then get back to you with any findings which may shed light on the cause for this discrepancy.

Yannis

jimdempseyatthecove · ‎09-16-2019

>> I will also try to run two versions of the program in Vtune Amplifier, to check whether there is any major discrepancy on where the bulk of the execution time is spent

Good strategy

Jim Dempsey

Ioannis_K_ · ‎09-17-2019

Jim, I took some time to further examine the situation. Here is the new information.

First, the problem is NOT due to an incompatibility with runtime libraries. I just received a new computer, so I had a fresh install of Visual Studio 2017 and one of the latest versions of the Intel Fortran compiler (2019 - update 4). The same problem occurs.

Second, I was able to profile performance with VTune Amplifier. I was able to detect major discrepancies between the parts of the code which consume the bulk of the computation time. I need to emphasize that the runs with the code built using the 2013 edition were done on a PC having a dual Xeon e5-2690 v2 processor, while the runs with the code built using the 2019 edition were done on a PC having a dual Xeon E5-2697 v3 processor. I have verified that code built with the 2013 edition leads to very similar speed of execution in both computers. The slowdown is obtained when I use the newer compiler on either machine.

I attach a pdf file, showing snapshots of the "bottom-up" view of the report produced by VTune Amplifier. Each version of the compiler was profiled twice, one time using a single thread, a second time using 36 threads.

The 2013 compiler gives results which were - to some extend - expected. I may have indications on how to improve some aspects of the code.

On the other hand, the 2019 compiler gives worse performance for a single thread than the 2013 version, and a shockingly bad performance for 36 threads. You may be able to notice that the routines causing most of the trouble have not been created by me, and they are associated with "secondary operations" (such as memory allocation?).

Based on the above, do you have any insight on why the 2013 compiler leads to much faster execution than the 2019 compiler? Furthermore, is there any option in the 2019 compiler that would resolve the serious slowdown obtained for multi-thread execution?

Once again, thank you for all your help on this.

jimdempseyatthecove · ‎09-18-2019

The VTune charts point to the culprit.

The 2013 shows STRSUPDATE, BEAM3DEB, NtDelayExecution, for_cpstr, intel_fastmemcmp, ...
The 2019 shows for__acquire_semaphore_threaded, for_deallocate, for_dealloc_allocatable, SleepEx, for__acquiresemaphore_threaded, for_allocate, kmp_fork_barrier, for_deallocate, __kmp_join_barrier, then finally some work BEAM3DEB

Either: a) your memory allocations are performed differently (e.g. one is using heap arrays for local arrays and the other not), or b) the heap allocator of one differs from the other significantly. Also SleepEx may be called more often dependent on KMP_BLOCKTIME (or OpenMP equivalent).

As for different heap managers. Note that scalable_free (and assume scalable_allocate) are located within libiomp5.dll. In earlier versions this code was derived from the TBB scalable allocator (came in with OpenMP V4 with OMP TASKing). See if you can get the application to load the libiomp5.dll from the 2013 version. I am not certain if you will need to also link in the 2013 versions of some TBB DLL(s) as well. Note, hopefully, the export symbols are the same between the versions.

Jim Dempsey

jimdempseyatthecove · ‎09-18-2019

Note, an experimental hack would be to copy the DLL(s) to the same location as the .EXE.

Jim Dempsey