Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!
26750 Discussions

Poor performance when using OpenMP + VTune report

Ioannis_K_
Novice
140 Views

Hello,

I am running a large code I have created, and have noticed that I do not get the desired speed up when using openMP (for example, 16 threads is as fast as 4 threads). 

I used the VTune Amplifier to profile my code, and the results indicate that an unreasonably large amount of time is spent in functions BaseThreadInitThunk, RtlUserThreadStart and [OpenMP worker] (I am not sure if the last one is an actual function - the VTune report lists it as such).

I simply wanted to ask whether it is possible to know what tasks are performed by the specific functions, and whether these functions could indicate what I may need to change to my code to maximize the benefit of multi-threaded execution.

Any input/suggestions on this will be greatly appreciated.

0 Kudos
3 Replies
jimdempseyatthecove
Black Belt
140 Views

>> the results indicate that an unreasonably large amount of time is spent in functions BaseThreadInitThunk, RtlUserThreadStart

Those are the initial entry points when you create a thread (top of the thread's call stack)

IOW, you will also see an unreasonably large amount of time spent in the main PROGRAM.

I suggest you set the VTune view to Bottom-Up as opposed to Top-Down

Jim Dempsey

Ioannis_K_
Novice
140 Views

Jim, thank you for your reply.

I attach a photo from the Bottom-Up view. It seems to me that there is an unreasonably large amount of time spent on thread initialization (?).

I must clarify that the program includes a particular multi-threaded region, which is executed many times. The multi-threaded region essentially uses two routines (these are the routines Q4 and ASSEMBLY1, which are also listed in my VTune report - btw the amount of CPU time spent on these is "reasonable"). 

At this point, my question is: is there a way to know if there is something I can change in my code, to avoid spending that much time on thread initialization? For example, is there a way to combine the Intel Advisor with VTune amplifier? I would imagine that, if there is a blatant issue in my code preventing speedup, these automated tools would be able to point to the cause of this...

Thank you, and I apologize if my questions are too basic for the forum...

jimdempseyatthecove
Black Belt
140 Views

The OpenMP worker entry in the VTune report may be a red herring. By this I mean the data in the report may not be meaningful.

What may provide additional information/insight is to use the Threads pane of VTune to look at the individual thread CPU Times to see if the runtimes are balanced. You should be aware that for__write_output (at least to the same I/O unit) will (should) have a critical section, thus serializing the code through that point. There may or may not be other issues if all threads are issuing the WRITE (and no appropriate programming considerations made).

*** I notice that you are using MKL read the following carefully

Programmers familiar with C/C++ programming have learned that a multi-threaded program is supposed to (required to) link with a multi-threaded library.

With respect to MKL, the application of the C/C++ term "multi-threaded" is a misnomer. What really is meant is the library must be thread-safe (multi-thread-safe).

MKL terminology:

Multi-threaded: The application has but a single thread and the MKL library itself will use multiple threads (OpenMP)
Single-threaded: The application may be single threaded or multi-threaded and for each caller MKL will use the callers thread.

*** Should you link the MKL multi-threaded library together with your 16-thread multi-threaded application, each of those 16 threads will request of MKL to instantiate a thread pool of 16 threads (16 * 16 = 256 threads).

Jim Dempsey

Reply