I am familiarizing myself with VTune trying to reduce OpenMP overhead for my application. (if possible)
In part of the documentation here I could read the below resolution to reduce spin time and overhead time.
I just don't understand what "increasing task granularity or the scope of data synchronization." means here.
I would appreciate any further clarifications for that matter.
What that means (suggests) is you appear to have sub-partitioned the work into too small of code sections for task dispatch. IOW consider increasing the amount of work each task performs.
Excessive spin time can be cause by: unbalanced work, too little work for parallel region/task, high arbitration for critical/atomic section.
I would also interpret it as "consider increasing task granularity in the case of scheduler overhead or decrease the scope of data synchronization to avoid thread contention leading to spinning (active waits)". But of course it sounds a bit general. Can recommend to look at VTune Cookbook article to see some hints in action.
BTW - spinning in OpenMP worker threads also can be a result of waiting on serial code execution in the master thread. There is also a cookbook article on this.
>> spinning in OpenMP worker threads also can be a result of waiting on serial code execution in the master thread.
This spinning after the completion of a parallel region/loop is an optimization strategy to avoid additional overhead of making an O/S call to suspend or terminate non-master threads only to reestablish non-master threads immediately thereafter. The duration of the inter-parallel region spin-wait time can be controlled by the environment variable KMP_BLOCKTIME.
A task IMHO is that which runs between the scheduler start task and completion of the task.
A task may contain synchronization primitives such as mutex, where the mutex spin-waits in lieu of yield (e.g. at OpenMP critical section or OpenMP atomic section), and/or end of OpenMP loop/parallel region which contains an implicit synchronization primitive where as threads reach that synchronization point enter a spin-wait until all threads of the current thread team reach the point. Note, this spin-wait time (by each thread) runs in the context of the O/S scheduled thread task.
>>> Note, this spin-wait time (by each thread) runs in the context of the O/S scheduled thread task.
Are you suggesting the spin-wait time is spent outside the thread task but inside OpenMP runtime?
Also, if you don't mind, I have a hard time following the be sentence.
IOW, I can't follow what overhead functions are?
Can we say overhead time is the amount of time take by OpenMP runtime to initialize a parallel region?
Thanks so much for links to the balance tuning articles.
We are extremely sorry for the delay.
Please find the answers for your queries below:
>> Are you suggesting the spin-wait time is spent outside the thread task but inside OpenMP runtime?
Answer: The spin-wait time is measured as an overhead and is listed as part of OpenMP overhead. It is not the time spend in doing meaningful tasks but simply an overhead from OpenMP runtime. Large spin time could also results from load imbalance between threads.
>> VTune Profiler classifies the stack layers into user, system, and overhead layers and attributes the CPU time spent in system functions called by overhead functions to the overhead functions.
I can't follow what overhead functions are?
Answer: If the OpenMP runtime makes a system call, the time spent in the system call is also accounted as OpenMP overhead.
>> Can we say overhead time is the amount of time take by OpenMP runtime to initialize a parallel region?
Answer: The overhead time includes everything from creating OpenMP threads, time taken to split the workload between threads , the synchronization between threads and the thread scheduling. Basically it includes initialize a parallel region and many other things as mentioned.
Hope this helps. If you have any further issue, please let us know.