I am using Vtune Amplifier to find hotspots in my F90 code, in Visual Studio 2005 and Windows 7 environments. The top 4 hot spots are
These functions/subroutines are definitely not in my code. My questions are
1)Why they are listed there when I can do nothing to improve my code (or I am wrong, see Q3 below)?
2)How can I find out where (in my code) they are called?
3)Given that these functions/subroutines take most of the time, can I really do anything to speed up the run?
Thanks a lot!
The listed functions show that you use Intel OpenMP. We have OpenMP analysis feature that allows you to look at OpenMP usage efficiency and estimate wall time potential gain of the inefficiencies.
If you use hotspots or advanced-hotspots analysis you will see "OpenMP Analysis" section on summary pane with aggregated metrics. The next section will show you top 5 OpenMP regions by potential gain (in wall time). If you see potential gain that worth to fight for - go by link on region name to bottom up view - you will have grid grouped by OpenMP regions/barriers - there you can expand "Potential Gain" column and see wall time impact of imbalance, scheduling, locks, atomics etc. towards regions/barriers. The cells with significant time will be highlighted with hints how to avoid the inefficiency. You can also refer to https://software.intel.com/en-us/node/596573 for more details.
Thanks & Regards, Dmitry
Hi Tim and dmitry,
Thank you for the reply/suggestions. Both of you mean these functions are in openMP. This is even more weird. Because all my openmp code lines have the the conditional compilation prefix, !$, and my code compilation did not invoke openmp by using "/Qopenmp". My full compiling options are:
/nologo /debug:full /Qunroll:100 /Qparallel /Qipo /I"C:\Program Files\Intel\MPI\4.0.1.007\ia32\include" /module:"Release\\" /object:"Release\\" /libs:static /threads /c
So no openMP should be invoked.
That might be due to option /Qparallel which optimizes loop as parallel threads. See below from Intel C++ compiler's doc - (also you can review VTune(TM) Amplifier's report by using "group-by thread" option, how many threads you will see?)
When the compiler is unable to automatically parallelize loops you know to be parallel, use OpenMP*. OpenMP* is the preferred solution because you understand the code better than the compiler and can express parallelism at a coarser granularity. Alternatively, automatic parallelization can be effective for nested loops, such as those in a matrix multiply. Moderately coarse-grained parallelism results from threading of the outer loop, allowing the inner loops to be optimized for fine-grained parallelism using vectorization or software pipelining.
You are right! While I thought functions like "__kmp_end_split_barrier" are invoked only when openMP is used, actually they are invoked also when "/Qparallel" is used (confusing to me indeed).
Thank you so much!
Qparallel is a compile option for automatic parallelization using the OpenMP library. If you use Qpar-threshold to increase parallelism, you may expect to see excessive parallelization where there is a net performance loss.