Hi, I am profiling a highly optimized code using VTune XE. To do this I run the hotspot and cuncurrency analysis types. When opening the bottom-up view I can see that a substantial portion of time (about 20% of CPU time) is spent in libiomp5.so, which is called from clone->start_thread, from the libiomp5 module. I also see libiomp5 in the inner functions called from the module that corresponds to my code.
It is my understanding that the libiomp5.so in the inner functions is the time spent in clone threads of these functions. Is this correct ? More importantly, does the CPU time reported in the libiomp5 module relate to threading overhead ? (Thread synchronization, Thread pool instantiation)
What you saw time spent in [libiomp5.so] was for activity in OpenMP* code region. I assume that you are using Intel C++ Compiler (or Composer XE 2011). libiomp5.so in the inner function <-start_thread<-clone, meant that program created new thread(s), whichwas implemented by libiomp5.so (Intel library for OpenMP in C++Compiler package)
CPUtime reported in libiomp5 module is not threading overhead revelant. That is OpenMP* code workload in CPU time.
Thank you Peter for the quick reply. Indeed, I am using the Intel C++ compiler.
Just to make sure, The CPU time reported in libiomp5 (within the libiomp5 module, not my modules) is OpenMP workload. The CPU time reported in libiomp5 in my modules is the CPU time used to run my code (the time reported there seems to be in line with the amount of working threads). Are these statements correct ?
Is this workload of OpenMP typical for threaded code ? Before using the new VTune XE, I could not reliably measure the OpenMP CPU time. It seems to be taking much more resources than I had anticipated.
If CPU time reported in main called by libiomp5.so, in users module - which are are real multithreaded works in other threads. You can go source view and observe CPU time in users code and OpenMP directives.
If CPU time reported in libiomp5.so, they are OpenMP* sync APIs. For example, multithreaded works communicate with the API to access critical section; main thread communicates with the API to do fork-join (Use LocksAndWait analysis to know wait time).