Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Overhead for getSystemCounterState

Gaurav_B_Intel
Employee
460 Views

Hi,

I have recently started using PCM and absolutely love the kind of information it provides. I am using the C++ APIs to instrument my code.

Particularly interested in measuring memory related issues such as total memory BW achieved, and Cache misses etc.

I use the getSytemCounterState API to get before and after states, and then use the API to get the total bytes read and written to MCs.

I am, however, seeing quite a large overhead for getSystemCounterState function call. 

I see:

Around 1 ms on a 10 core Broadwell desktop

Around 6 ms on a 20 core Skylake Xeon

Around 64 ms on 68 cores KNL!

Is this expected? This is really modifying the numbers I am seeing. Is there any way to avoid this overhead?

Thanks,

Gaurav.

0 Kudos
4 Replies
Richard_Nutman
New Contributor I
460 Views

getSystemCounterState returns the counters for the entire system, that means all sockets, and all logical cores within those sockets.  Therefor the more cores you have the longer this function will take.

The question is do you need to check the entire system counters, or can you rework to just query on specific core's which would be alot quicker ?

0 Kudos
Gaurav_B_Intel
Employee
460 Views

Thanks for your response.

I need to query metrics such as BytesReadFromMCs, LLCMisses, etc. These are not Core metrics.

Can I still get this information by querying a specific core?

0 Kudos
Richard_Nutman
New Contributor I
460 Views

Try and see if you can use getUncoreCounterStates.

It retrieves uncore information for system and sockets, but no core information.

0 Kudos
Roman_D_Intel
Employee
460 Views

Unfortunately getSystemCounterState is not parallelized yet. It takes linear time to read all core counters. As suggested you can try using getUncoreCounterStates or getAllCounterStates. The latter parallelizes the most time consuming part for reading core counters. Make sure you use the latest PCM from github (https://github.com/opcm/pcm/).

Thanks,

Roman

0 Kudos
Reply