Using Vtune Amplifier concurrency analysis on an example code of dgemm (link here), the overhead and spin time surprisingly covered almost 100% of the CPU usage bar! (reported here). I tried VTune concurrency profiling tool for sparse matrix by vector multiplication kernel mkl_dcsrsymv as well, and similar result was obtained. Since in the examples mentioned here, a very high performance is achieved, the large overhead reported seems irrelevant. I initially asked for an explanation in VTune Amplifier forum (here) and I was advised to ask the question in this forum.
Do you have any explanation for the large overhead and spin time?
note: Vtune Amplifier update 11, Intel Composer XE 2013 are used.
Hi Sergey, thanks for the reply. The RAM and virtual memory are 64 GB each with a Xeon E5-1650 processor, Hyperthreading disabled. OS Windows 7. MKL Composer XE 2013. For the dense matrix multiplication, the screenshot is for matrix A (2000x200) and B(200x1000) in a for loop repeated for 200 times, getting an average time of 10 msec per multiplication. Once again I'd like to emphasize the GFLOPS I achieve is pretty high, but the profiling results are questionable. For case of sparse matrix by dense vector multiplication, the matrix size is about 300kx300k. Almost 100% overhead and close to zero usage for CPU is consistent in all the test cases, with different sizes.
Almost zero CPU usage could be in cases when I/O requests
Good point, but I ran the tests, pausing profiling for I/O operations and resuming for computational kernels and yet I got close to zero CPU usage, which does not make sense!
Below is a screenshot from task manager, which shows two peaks on CPU usage referring to twice executation of mkl_dcsrgemv. The memory required to store the matrix is less than 200 MB. My understanding is VTune profiler subtracts the overhead time from CPU usage time and reports it as net CPU usage in the summary page. Since the overhead is reported close to 100 %, the CPU histogram shows almost zero usage. But the question remains there, why close to 100 % overhead? a bug from VTune?
We can see the same behavior as yours. I will escalate the result to our developers and see if there is some improvement on this.
Regarding the result, I think, it is expected with current dgemm parallel algorithm. As perter explained, the overhead time (or threading tansition time) is high because if one big task are splitted into many small tasks in many threads, adn each task only has small workload then swith to another threads. So accumulate overhead time will be big. It is true actally for MKL dgemm function, internally, the data are divided into very small blocks. and many threads (6x200 loop)workers) keep geting the small blocks data to computing. and yes, in general case, we should adjust algorithm (merge small tasks) to reduce thread transitions. But here we actauly need to consider the balance of task granularity and compute efficiency. We hope one thread can perform more computation, at the same time, we need keep enough data is ready for the compution. so the task granularity is key, if it is too small, the overhead is higher, if it is too big, then the data is not ready, the computing can't going. So it is a balance result. even with the high overhead and spin time, a very high performance can be achieved.
The wait time (or a high amount of spin time) does not necessary mean an inefficient synchronization primitive. It can be related to the synchronization time needed to balance work. In case of sparse matrices, the imbalance is caused by the input data and structure of the matrix. I wonder if the picture improves with OMP_SCHEDULE=DYNAMIC and OMP_DYNAMIC=1?
I have almost the same problem with you, which have very high overhead and spin time. And in Vtune, it shows that the spin time is the main problem. Can you give me some information about how did you solve this problem finally? Thanks a lot.
Since the earlier posts were made, new releases of MKL may have improved their provision for balancing work in the sparse methods. I haven't seen MKL use schedule dynamic, rather, an internal algorithm is used for static workload balancing (pre-processing to estimate work in each loop iteration and distributing work accordingly). Such a method may be superior to dynamic, particularly on a platform with NUMA characteristics.
If MKL doesn't need all the threads, there may be apparent overhead associated with spinning of idle threads. You need to "drill down" to individual threads to see if this is the case: if some threads are running efficiently, it may be inconsequential that others are spin waiting at the same time.
You may need a fairly large problem to overcome the overhead of starting dynamic libraries and arranging thread pools. It's larger on Windows than linux, much larger yet on Intel(r) Xeon Phi(tm), where nearly half a second may be spent, at least the first time within a run.
Sergey Kostrov wrote:
Here is a quote from your another post:
>>...CPU usage histogram shows almost zero usage!..
Almost zero CPU usage could be in cases when I/O requests, for example data are read from VM paging file, are Not completed.
Task Manager does not show CPU load in total clock cycles per thread so there is no really zero usage.On Windows KeIdleLoop function will run when there is low activity on some cores.
I/O requests from paging file will have increased activity on CPU0 which is servicing ISR and DPC's.
Thank you for sharing your ideas about the problem. We escalated the issue to Vtune team and mkl team. Please see below analysis which should disclose the truth: MKL function was treated as system function in Vtune. The overhead and spin times are actually from MKL computing kernel function. it is computing time, but was marked as overhead time. We may fix this. Please see detials below.
The CPU utilization is zero, because all CPU time is overhead. Overhead and spin time is subtracted from CPU time, because VTune considers busy waiting as waiting. This problem will disappear if MKL computing functions are not marked as overhead.
In usual OpenMP region, if there is an OMP function like “_kmp_invoke_task_func” in the stack, and there is user code below in stack, no overhead is reported. But if there are only system functions in the stack, they all are considered as overhead. And MKL is treated as system function.
To fix this, VTune should apply smarter rules, so library computations are not marked as overhead, but actual overhead is still reported. This will require some investigation from vtune development, they are thinking about it.
For now, there is a workaround. In VTune config file “<vtune_install_dir>/config/resolver/openmp.cfg”, delete 3 marked lines:
<systemTypePatterns objectType="function" boolean:ignoreCase="false">
Then re-resolve the result. Don’t save original file copy in the same directory, even with different name – in this case there will not be an effect.
After that CPU time and CPU utilization are reported as expected.
Abhi filed a Premier request, and I suspect I know which three lines are to be removed. Unfortunately, they were REMOVED from the listing above rather than MARKED. I will post what I discovered, but I will rely on H. Ying to confirm that these are the lines in question. The unmodified file above begins like this:
<systemTypePatterns objectType="function" boolean:ignoreCase="false">
You will note three additional lines in this listing, missing in the listing up above. These apparently are regular expressions for matching certain function names and certainly seem more plausible as the lines in question. I suspect these are the lines that are recommended for removal from this configuration file as a workaround.
Well, harrumph! I don't know why the editor associated with this interface provides the means to apply color backgrounds to the text when the text itself does not end up highlighted. I can see the lines highlighted right now! Can you? I didn't think so. Therefore, I've edited the insert and marked the three lines in question by prefacing them with ">>>>". Now can you see them?
Hi Abhi, Robert,
thanks much for asking. The mark was removed. The three lines are