I have recently started using PCM and absolutely love the kind of information it provides. I am using the C++ APIs to instrument my code.
Particularly interested in measuring memory related issues such as total memory BW achieved, and Cache misses etc.
I use the getSytemCounterState API to get before and after states, and then use the API to get the total bytes read and written to MCs.
I am, however, seeing quite a large overhead for getSystemCounterState function call.
Around 1 ms on a 10 core Broadwell desktop
Around 6 ms on a 20 core Skylake Xeon
Around 64 ms on 68 cores KNL!
Is this expected? This is really modifying the numbers I am seeing. Is there any way to avoid this overhead?
getSystemCounterState returns the counters for the entire system, that means all sockets, and all logical cores within those sockets. Therefor the more cores you have the longer this function will take.
The question is do you need to check the entire system counters, or can you rework to just query on specific core's which would be alot quicker ?
Thanks for your response.
I need to query metrics such as BytesReadFromMCs, LLCMisses, etc. These are not Core metrics.
Can I still get this information by querying a specific core?
Unfortunately getSystemCounterState is not parallelized yet. It takes linear time to read all core counters. As suggested you can try using getUncoreCounterStates or getAllCounterStates. The latter parallelizes the most time consuming part for reading core counters. Make sure you use the latest PCM from github (https://github.com/opcm/pcm/).