this may be a stupid question... but what would be a good method/counter to estimate the (per core/per socket) utilization of the CPU. I would like to get a percentage like many OS tools (e.g., nmon, ...) provide them - but since I am collecting counters in my experiment framework anyway, it would be easier to use them instead of OS tools.
Any suggestions are appreciated
Sergey, I am on Linux... sorry - forgot to mention.
iliyapolak: your link leads me to the performance counter reference... I don't see how that helps.
To clarify: I know how to use Intel PCM and how to collect (any) hardware performance counter. I don't want to use this functionality in my own code. Instead, I start the PCM - run some test - stop PCM - look at collected counter values (usually collected per second).
What I would like to know is, which counter(s) will tell me the CPU utilization... like UOP_RETIRED_ANY??? * TSC.
Thanks for your help :)
I see how I can count the uops/instructions that are actually executed. But to calculate the utilization, I would also need the maximum possible number of instructions that can be executed so that I can output "actual"/"maximum". Comparable to link utilization... sending 6GB/s and having a maximum of 12.8GB/s I get about 50% link utilization (give or take) ;).
I'll think about it for another while...
The term 'cpu utilization' is pretty vague. There are many measures of cpu utilization. This implies a resource that has a maximum and you'd like to know what % of max is getting used.
For instance, you can look at %idle (or %halted): http://software.intel.com/en-us/articles/measuring-the-halted-state/
Or the average unhalted frequency: http://software.intel.com/en-us/articles/measuring-the-average-unhalted-frequency/
Or the IPC (instructions per clocktick) where the max is 4 or 5 instructions per clocktick.
But in my experience, I've found that trying to dig deeper into the micro-architecture statistics should only be undertaken when you are sure that your application is cpu-bound, not waiting on cache misses, disk IO, network IO, etc. You can see the 'top down' methodology here: http://software.intel.com/en-us/blogs/2011/05/04/top-down-methodology-for-software-performance-analy...
If the IPC is low or your app isn't near 100% of the cpu, you are probably not bottlenecked in the cpu. 'Time' is the critical factor. I usually look to see where in the code the time is being spent with something like VTune.
If your code is bottlenecked by the cpu, you can use this (http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/2011Update/lin/ug_docs/... ) analysis methodology to understand better where the bottleneck is.
>>>Comparable to link utilization... sending 6GB/s and having a maximum of 12.8GB/s I get about 50% link utilization (give or take) ;).>>>
If you are interested also in measuring bus utilization follow this link://software.intel.com/en-us/forums/topic/281625
Sorry, I didn't get to this earlier. Thanks to both of you for the pointers to the material - quite some interesting reading!
iliyapolak: Thanks for the Hyperthreading document - that made a few things clear. Although I still can't figure out, what the OS is showing me. What is Windows showing in the Task Manager performance view as CPU utiliaztion?
Pat: Thanks for the pointers - all the methods you pointed out seem to work in my case. To clarify, I am not hunting the last bit of performance in my application. My applications are synthetic and sometimes CPU-bound, sometimes memory controller bound and sometimes bound by the QPI link bandwidth - all on purpose. What I am interested in right now is rather a relative measure between different cores/sockets - say cores 4-8 are significantly busier than cores 9-12. I guess that I can get this information with the methods that you pointed me to. And ultimately, I will be careful when I interpret any of the results.
Last question to anybody: what do OS tools (windows task manager, linux nmon) show me as CPU utilization. There I often see 100% usage, but I doubt that the application is finishing 4-5 instructions per cycle (which would be the ultimate maximum if I understood anything correctly;)).