Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Poor performance when using OpenMP + VTune report

Ioannis_K_
New Contributor I
982 Views

Hello,

I am running a large code I have created, and have noticed that I do not get the desired speed up when using openMP (for example, 16 threads is as fast as 4 threads). 

I used the VTune Amplifier to profile my code, and the results indicate that an unreasonably large amount of time is spent in functions BaseThreadInitThunk, RtlUserThreadStart and [OpenMP worker] (I am not sure if the last one is an actual function - the VTune report lists it as such).

I simply wanted to ask whether it is possible to know what tasks are performed by the specific functions, and whether these functions could indicate what I may need to change to my code to maximize the benefit of multi-threaded execution.

Any input/suggestions on this will be greatly appreciated.

0 Kudos
3 Replies
jimdempseyatthecove
Honored Contributor III
982 Views

>> the results indicate that an unreasonably large amount of time is spent in functions BaseThreadInitThunk, RtlUserThreadStart

Those are the initial entry points when you create a thread (top of the thread's call stack)

IOW, you will also see an unreasonably large amount of time spent in the main PROGRAM.

I suggest you set the VTune view to Bottom-Up as opposed to Top-Down

Jim Dempsey

0 Kudos
Ioannis_K_
New Contributor I
982 Views

Jim, thank you for your reply.

I attach a photo from the Bottom-Up view. It seems to me that there is an unreasonably large amount of time spent on thread initialization (?).

I must clarify that the program includes a particular multi-threaded region, which is executed many times. The multi-threaded region essentially uses two routines (these are the routines Q4 and ASSEMBLY1, which are also listed in my VTune report - btw the amount of CPU time spent on these is "reasonable"). 

At this point, my question is: is there a way to know if there is something I can change in my code, to avoid spending that much time on thread initialization? For example, is there a way to combine the Intel Advisor with VTune amplifier? I would imagine that, if there is a blatant issue in my code preventing speedup, these automated tools would be able to point to the cause of this...

Thank you, and I apologize if my questions are too basic for the forum...

0 Kudos
jimdempseyatthecove
Honored Contributor III
982 Views

The OpenMP worker entry in the VTune report may be a red herring. By this I mean the data in the report may not be meaningful.

What may provide additional information/insight is to use the Threads pane of VTune to look at the individual thread CPU Times to see if the runtimes are balanced. You should be aware that for__write_output (at least to the same I/O unit) will (should) have a critical section, thus serializing the code through that point. There may or may not be other issues if all threads are issuing the WRITE (and no appropriate programming considerations made).

*** I notice that you are using MKL read the following carefully

Programmers familiar with C/C++ programming have learned that a multi-threaded program is supposed to (required to) link with a multi-threaded library.

With respect to MKL, the application of the C/C++ term "multi-threaded" is a misnomer. What really is meant is the library must be thread-safe (multi-thread-safe).

MKL terminology:

Multi-threaded: The application has but a single thread and the MKL library itself will use multiple threads (OpenMP)
Single-threaded: The application may be single threaded or multi-threaded and for each caller MKL will use the callers thread.

*** Should you link the MKL multi-threaded library together with your 16-thread multi-threaded application, each of those 16 threads will request of MKL to instantiate a thread pool of 16 threads (16 * 16 = 256 threads).

Jim Dempsey

0 Kudos
Reply