OpenMP threading performance

John_Young · ‎05-01-2019

Hi,

I don't know if this is the proper forum for OpenMP questions, but we are observing this phenomenon in one of our Fortran programs. If there is a better forum for this question, please let me know. We have been unable to replicate it with a small test case, so the best I can do is describe it.

Our development environment is Visual Studio 2017 with Intel Fortran 2019 and Intel MKL 2019. We have confirmed the behavior under 64-bit Windows 7 and Window 10.

When we compile our code statically (I'm not 100% the terminology here) by setting "Use MFC in a static library" and the Fortran runtime libary to Multithreaded, we see that our code has very good threading performance in OpenMP regions although the memory usage can be quite high if the number of threads is large. Furthermore, for the static build, deallocating arrays and freeing MKL buffers (mkl_free_buffers) rarely shows much effect when observing the memory usage in the Windows task manager.

On the other hand, when we compile our code dynamically (not 100% sure of the terminology here) by setting "Use MFC in a shared library" and the Fortran runtime library to "Multithreaded DLL", we see that our code can have very poor threading performance in OpenMP regions bu the memory usage seems much better. In particular, deallocating arrays and freeing MKL buffers seems to show quite a large memory drop when observing the Windows task manager. The OpenMP behavior is particular troublesome. In my investigation of it, what seems to be happening is that the all threads are created in the OpenMP threaded region, but each thread runs sequentially. For example, if I write out the thread number in the OpenMP loop, I see all the different threads reporting, but if I observe the Windows Task Manager, it only seems like one thread is actually active at a time.

The threading efficiency behavior we see in the dynamic build seems to be related to the problem size and number of threads. For small problems or small numbers of threads, the threading seems to work properly. As the problem size increases and/or number of threads increases, some threshhold seems to be reached where the threads seem to executed sequentially even though all the threads are created in a given region. For a given problem size, we might observe perfectly fine threading behavior for 8 threads but start to see the poor behavior for 16 threads. For larger problems, even 8 threads may exhibit the behavior. In addition, our program has many different threaded regions, and only some of the regions may exhibit the behavior in a particular simulation.

We have been struggling with this issue for about six months, so if anyone from Intel can offer any comments or advice, we would be very appreciative. We would love to run a 'dynamic' build to gain the better memory performance, but losing the OpenMP efficiency for larger problems is problematic for large and long simulations.

Thanks,

John

jimdempseyatthecove · ‎05-10-2019

>> for the static build, deallocating arrays and freeing MKL buffers (mkl_free_buffers) rarely shows much effect when observing the memory usage in the Windows task manager

It won't.

Your process (program) uses virtual memory. The heap (be it the CRTL heap, or the TBB scalable allocator heap(s) as used by OpenMP) only acquire resources as pages are touched. When freed they are returned to the heap. but yet remain touched by the process and remain available to the process for subsequent use. IOW will not incur the additional overhead of page faulting to the O/S for committing the virtual memory page to RAM and/or the page file. Thusly, the Task Manager will not see reclamation of freed memory. Note, Windows has a separate (from standard heap) allocation/deallocation API that releases not only physical RAM (when allocated) but page file space (when allocated). The standard heap (as well as TBB scalable allocator) hangs on to allocated page file space until process end (or termination of thread pool).

>> but each thread runs sequentially. For example, if I write out the thread number in the OpenMP loop, I see all the different threads reporting, but if I observe the Windows Task Manager, it only seems like one thread is actually active at a time.

We would have to see some code as well as know which MKL library you are linking with as well as environment variables (both OpenMP an MKL).

You should note that generally (with some exceptions) when you have an OpenMP application using MKL that you link the serial version of MKL. Conversely when you have a serial application, you link with the parallel version of MKL. IOW only one of the domains (application .OR. MKL) is designated to be parallel. The reason for this is, for an OpenMP application, is to have MKL avoid creating its own OpenMP thread pool for each of the applicaton's OpenMP threads that call into MKL. e.g. having overscription of N * N threads when you expected to use only N threads.

>>As the problem size increases and/or number of threads increases, some threshhold seems to be reached where the threads seem to executed sequentially even though all the threads are created in a given region. For a given problem size, we might observe perfectly fine threading behavior for 8 threads but start to see the poor behavior for 16 threads.

This may be related to the MKL library chosen or possibly some other external function/subroutine you are calling that requires serialization (random number generator, heap allocation/deallocation,, OpenMP critical section, mutex, atomic, etc...), and/or loop iterations too small to be efficiently used by larger thread counts.

The use of static library verses DLL having performance is a bit confusing. For each of your builds, try running a Debug build, insert a break point deep inside your code (after MKL calls), then at break point, look at the total number of threads in use by your process. This is an easy way to see if you have oversubscription due to the wrong MKL DLL being loaded.

>>Use MFC in a shared library

This would seem to imply that your process is mixed language. This may introduce additional (unexpected) threading complications.

If (as an example), your main code is written in C#, the language tends to have you continually instantiate short lived doWork threads. Be aware that each of the instantiations of a C# doWork thread (or C++ std::thread) that calls into your Fortran code, that itself contains OpenMP regions, that each call with a different thread handle, will instantiate a new OpenMP thread pool for use by that thread.

Jim Dempsey