- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Profiling my codes, I observed some curiously large overhead and spin time. Using Vtune Amplifier concurrency analysis on another example code of dgemm from MKL tutorial (link here), I learnt the overhead and spin time, surprisingly covered almost 100% of the CPU usage bar! (see the figure below)
According to what I know about overhead and spin time (consistent with the definition in Intel® Vtune Amplifier help), in an efficient parallel code, these metrics should be small and close to zero. It surprises me to see MKL matrix-matrix multiplication profiling shows almost 100% overhead and spin time. In the summary page, it shows : CPU time: 12.421, Overhead time: 10.125, Spin time: 2.170, concurrency ideal, CPU usage histogram shows almost zero usage! Can you please clarify this issue?
note: I am using update 11 of Vtune Amplifier.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
please let me know if it is a bug or there is something done wrong? thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Overhead time means that one thread releases the resource and another thread waits this resource to run next instructions, it is called threading transition time - usually it is minimal. However if you split big task into many small tasks in many threads, and each task only has small workload then switch to another thread, so accumulated overhead time will be big. In this situation, you have to adjust algorithm (merge small tasks) to reduce thread transitions. (for example, reduce n threads in OpenMP. Set your core number in omp_set_num_threads() or mkl_set_num_threads()?)
Spin time means that one thread is blocking to wait the resource free from another thread, it consumes high CPU time. You have to review if all spin locks are necessary, is it possible to use queue_lock instead of spin_lock?
Hope it helps.
Regards, Peter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks dear Peter for the clear explanation. I still don't see why MKL matrix-matrix multiplication kernel is reported so poor in the profiling. Almost zero CPU usage and close to 100% overhead of dgemm (I have posted an image of it above) is the reason making me think there is something wrong.
Regards, Ramin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It sounds good that you can understand these metrics from VTune Amplifier XE, I am not the expert in MKL. Please go MKL forum to ask why we had so much thread transitions in small time-period to cause high overhead. Thanks.
-Peter

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page