Dear all,

Ramin_M_ · ‎08-07-2013

Hello,

Using Vtune Amplifier concurrency analysis on an example code of dgemm (link here), the overhead and spin time surprisingly covered almost 100% of the CPU usage bar! (reported here). I tried VTune concurrency profiling tool for sparse matrix by vector multiplication kernel mkl_dcsrsymv as well, and similar result was obtained. Since in the examples mentioned here, a very high performance is achieved, the large overhead reported seems irrelevant. I initially asked for an explanation in VTune Amplifier forum (here) and I was advised to ask the question in this forum.

Do you have any explanation for the large overhead and spin time?

Cheers,

note: Vtune Amplifier update 11, Intel Composer XE 2013 are used.

SergeyKostrov · ‎08-08-2013

>>...Do you have any explanation for the large overhead and spin time? It would be nice to see a VTune screenshot. Also, details about your hardware, platform, MKL version, etc are needed. It is really hard to explain what is going on with your tests.

SergeyKostrov · ‎08-08-2013

Here is a quote from your another post: >>...CPU usage histogram shows almost zero usage!.. Almost zero CPU usage could be in cases when I/O requests, for example data are read from VM paging file, are Not completed.

SergeyKostrov · ‎08-08-2013

>>...It would be nice to see a VTune screenshot... I saw the screenshot. My questions are still the same - more technical details are needed. You even did not tell us how big are dense matricies and how much physical and virtual memory your system has.

Ramin_M_ · ‎08-09-2013

Hi Sergey, thanks for the reply. The RAM and virtual memory are 64 GB each with a Xeon E5-1650 processor, Hyperthreading disabled. OS Windows 7. MKL Composer XE 2013. For the dense matrix multiplication, the screenshot is for matrix A (2000x200) and B(200x1000) in a for loop repeated for 200 times, getting an average time of 10 msec per multiplication. Once again I'd like to emphasize the GFLOPS I achieve is pretty high, but the profiling results are questionable. For case of sparse matrix by dense vector multiplication, the matrix size is about 300kx300k. Almost 100% overhead and close to zero usage for CPU is consistent in all the test cases, with different sizes.

Ramin_M_ · ‎08-09-2013

Almost zero CPU usage could be in cases when I/O requests

Good point, but I ran the tests, pausing profiling for I/O operations and resuming for computational kernels and yet I got close to zero CPU usage, which does not make sense!

SergeyKostrov · ‎08-09-2013

>>...For case of sparse matrix by dense vector multiplication, the matrix size is about 300kx300k. Almost 100% overhead and close to >>zero usage for CPU is consistent in all the test cases... How much memory is allocated in that case? Take a look at Windows Task Manager ( Processes property page / 'Mem Usage', 'VM Size', etc columns ). Thanks.

SergeyKostrov · ‎08-09-2013

Could you post a screenshot of Windows Task Manager ( Performance property page / From the Beginning to the End of computations ) for the case with sparse matricies?

Ramin_M_ · ‎08-09-2013

Below is a screenshot from task manager, which shows two peaks on CPU usage referring to twice executation of mkl_dcsrgemv. The memory required to store the matrix is less than 200 MB. My understanding is VTune profiler subtracts the overhead time from CPU usage time and reports it as net CPU usage in the summary page. Since the overhead is reported close to 100 %, the CPU histogram shows almost zero usage. But the question remains there, why close to 100 % overhead? a bug from VTune?

thanks.

Ramin_M_ · ‎08-09-2013

duplicated message. please ignore this, the info is provided below.

Ying_H_Intel · ‎08-13-2013

Hi Ramin,

We can see the same behavior as yours. I will escalate the result to our developers and see if there is some improvement on this.

Regarding the result, I think, it is expected with current dgemm parallel algorithm. As perter explained, the overhead time (or threading tansition time) is high because if one big task are splitted into many small tasks in many threads, adn each task only has small workload then swith to another threads. So accumulate overhead time will be big. It is true actally for MKL dgemm function, internally, the data are divided into very small blocks. and many threads (6x200 loop)workers) keep geting the small blocks data to computing. and yes, in general case, we should adjust algorithm (merge small tasks) to reduce thread transitions. But here we actauly need to consider the balance of task granularity and compute efficiency. We hope one thread can perform more computation, at the same time, we need keep enough data is ready for the compution. so the task granularity is key, if it is too small, the overhead is higher, if it is too big, then the data is not ready, the computing can't going. So it is a balance result. even with the high overhead and spin time, a very high performance can be achieved.

Best Regards,

Ying

Ramin_M_ · ‎08-14-2013

Thank you Ying for the clarification. I'd be very glad to know if there is some potential for achieving higher performance.

Best regards, -Ramin

Hans_P_Intel · ‎08-19-2013

The wait time (or a high amount of spin time) does not necessary mean an inefficient synchronization primitive. It can be related to the synchronization time needed to balance work. In case of sparse matrices, the imbalance is caused by the input data and structure of the matrix. I wonder if the picture improves with OMP_SCHEDULE=DYNAMIC and OMP_DYNAMIC=1?

Baoyin_D_Intel · ‎06-23-2014

Hi Ramin,

I have almost the same problem with you, which have very high overhead and spin time. And in Vtune, it shows that the spin time is the main problem. Can you give me some information about how did you solve this problem finally? Thanks a lot.

Best, --Baoyin

TimP · ‎06-23-2014

Since the earlier posts were made, new releases of MKL may have improved their provision for balancing work in the sparse methods. I haven't seen MKL use schedule dynamic, rather, an internal algorithm is used for static workload balancing (pre-processing to estimate work in each loop iteration and distributing work accordingly). Such a method may be superior to dynamic, particularly on a platform with NUMA characteristics.

If MKL doesn't need all the threads, there may be apparent overhead associated with spinning of idle threads. You need to "drill down" to individual threads to see if this is the case: if some threads are running efficiently, it may be inconsequential that others are spin waiting at the same time.

You may need a fairly large problem to overcome the overhead of starting dynamic libraries and arranging thread pools. It's larger on Windows than linux, much larger yet on Intel(r) Xeon Phi(tm), where nearly half a second may be spent, at least the first time within a run.

Bernard · ‎06-23-2014

Sergey Kostrov wrote:

Here is a quote from your another post:

>>...CPU usage histogram shows almost zero usage!..

Almost zero CPU usage could be in cases when I/O requests, for example data are read from VM paging file, are Not completed.

Task Manager does not show CPU load in total clock cycles per thread so there is no really zero usage.On Windows KeIdleLoop function will run when there is low activity on some cores.

I/O requests from paging file will have increased activity on CPU0 which is servicing ISR and DPC's.

Ying_H_Intel · ‎07-06-2014

Dear all,

Thank you for sharing your ideas about the problem. We escalated the issue to Vtune team and mkl team. Please see below analysis which should disclose the truth: MKL function was treated as system function in Vtune. The overhead and spin times are actually from MKL computing kernel function. it is computing time, but was marked as overhead time. We may fix this. Please see detials below.

The CPU utilization is zero, because all CPU time is overhead. Overhead and spin time is subtracted from CPU time, because VTune considers busy waiting as waiting. This problem will disappear if MKL computing functions are not marked as overhead.

In usual OpenMP region, if there is an OMP function like “_kmp_invoke_task_func” in the stack, and there is user code below in stack, no overhead is reported. But if there are only system functions in the stack, they all are considered as overhead. And MKL is treated as system function.

To fix this, VTune should apply smarter rules, so library computations are not marked as overhead, but actual overhead is still reported. This will require some investigation from vtune development, they are thinking about it.

For now, there is a workaround. In VTune config file “<vtune_install_dir>/config/resolver/openmp.cfg”, delete 3 marked lines:

<bag

xmlns:boolean="http://www.w3.org/2001/XMLSchema#boolean"

xmlns:int="http://www.w3.org/2001/XMLSchema#int"

xmlns:double="http://www.w3.org/2001/XMLSchema#double"

>

<regex>^_?_kmp_launch_worker</regex>

<regex>^_?_kmp_invoke_task_func</regex>

<regex>^vcomp_fork</regex>

Then re-resolve the result. Don’t save original file copy in the same directory, even with different name – in this case there will not be an effect.

After that CPU time and CPU utilization are reported as expected.

Best Regards,

Ying

Bernard · ‎07-07-2014

Thanks for providing the explanation.

abhimodak · ‎07-08-2014

Hi

Can you please tell exactly which 3 lines to delete? Perhaps, it will be more useful if the modified configuration file is attached.

Abhi

robert-reed · ‎07-08-2014

Abhi filed a Premier request, and I suspect I know which three lines are to be removed. Unfortunately, they were REMOVED from the listing above rather than MARKED. I will post what I discovered, but I will rely on H. Ying to confirm that these are the lines in question. The unmodified file above begins like this:

<bag
xmlns:boolean="http://www.w3.org/2001/XMLSchema#boolean"
xmlns:int="http://www.w3.org/2001/XMLSchema#int"
xmlns:double="http://www.w3.org/2001/XMLSchema#double"
>
<systemTypePatterns objectType="function" boolean:ignoreCase="false">
    <systemTypePattern systemType="overhead">
      <regex>^_?_kmpc_fork_call</regex>
      <regex>^_?_kmp_launch_worker</regex>
      <regex>^_?_kmp_invoke_task_func</regex>
      <regex>^vcomp_fork</regex>
>>>>      <regex>^vcomp::PersistentThreadFunc</regex>
>>>>      <regex>^gomp_thread_start</regex>
>>>>    <regex>^gomp_team_start</regex>
    </systemTypePattern>

You will note three additional lines in this listing, missing in the listing up above. These apparently are regular expressions for matching certain function names and certainly seem more plausible as the lines in question. I suspect these are the lines that are recommended for removal from this configuration file as a workaround.

Well, harrumph! I don't know why the editor associated with this interface provides the means to apply color backgrounds to the text when the text itself does not end up highlighted. I can see the lines highlighted right now! Can you? I didn't think so. Therefore, I've edited the insert and marked the three lines in question by prefacing them with ">>>>". Now can you see them?

Ying_H_Intel · ‎07-08-2014

Hi Abhi, Robert,

thanks much for asking. The mark was removed. The three lines are

<regex>^_?_kmp_launch_worker</regex>

<regex>^_?_kmp_invoke_task_func</regex>

Thanks much!

Ying

Large overhead and spin time reported in MKL functions