I am running a large code I have created, and have noticed that I do not get the desired speed up when using openMP (for example, 16 threads is as fast as 4 threads).
I used the VTune Amplifier to profile my code, and the results indicate that an unreasonably large amount of time is spent in functions BaseThreadInitThunk, RtlUserThreadStart and [OpenMP worker] (I am not sure if the last one is an actual function - the VTune report lists it as such).
I simply wanted to ask whether it is possible to know what tasks are performed by the specific functions, and whether these functions could indicate what I may need to change to my code to maximize the benefit of multi-threaded execution.
Any input/suggestions on this will be greatly appreciated.
>> the results indicate that an unreasonably large amount of time is spent in functions BaseThreadInitThunk, RtlUserThreadStart
Those are the initial entry points when you create a thread (top of the thread's call stack)
IOW, you will also see an unreasonably large amount of time spent in the main PROGRAM.
I suggest you set the VTune view to Bottom-Up as opposed to Top-Down
Jim, thank you for your reply.
I attach a photo from the Bottom-Up view. It seems to me that there is an unreasonably large amount of time spent on thread initialization (?).
I must clarify that the program includes a particular multi-threaded region, which is executed many times. The multi-threaded region essentially uses two routines (these are the routines Q4 and ASSEMBLY1, which are also listed in my VTune report - btw the amount of CPU time spent on these is "reasonable").
At this point, my question is: is there a way to know if there is something I can change in my code, to avoid spending that much time on thread initialization? For example, is there a way to combine the Intel Advisor with VTune amplifier? I would imagine that, if there is a blatant issue in my code preventing speedup, these automated tools would be able to point to the cause of this...
Thank you, and I apologize if my questions are too basic for the forum...
The OpenMP worker entry in the VTune report may be a red herring. By this I mean the data in the report may not be meaningful.
What may provide additional information/insight is to use the Threads pane of VTune to look at the individual thread CPU Times to see if the runtimes are balanced. You should be aware that for__write_output (at least to the same I/O unit) will (should) have a critical section, thus serializing the code through that point. There may or may not be other issues if all threads are issuing the WRITE (and no appropriate programming considerations made).
*** I notice that you are using MKL read the following carefully
Programmers familiar with C/C++ programming have learned that a multi-threaded program is supposed to (required to) link with a multi-threaded library.
With respect to MKL, the application of the C/C++ term "multi-threaded" is a misnomer. What really is meant is the library must be thread-safe (multi-thread-safe).
Multi-threaded: The application has but a single thread and the MKL library itself will use multiple threads (OpenMP)
Single-threaded: The application may be single threaded or multi-threaded and for each caller MKL will use the callers thread.
*** Should you link the MKL multi-threaded library together with your 16-thread multi-threaded application, each of those 16 threads will request of MKL to instantiate a thread pool of 16 threads (16 * 16 = 256 threads).