we are investigating performance monitoring on intel processors. We noticed differences in their capabilities for doing this for two intel processors we currently test.
Which intel processors currently have the best support/capability for performance monitoring (e.g number of hardware performance counters)?
Thanks in advance and best,
Thank you for posting on the Intel® communities.
I would like to confirm some information, can you please provide the following?
Intel Customer Support Technician
thanks a lot for your reply.
Yes, we are aware of intel pcm and actually we have tested that.
The purpose of our investigation is to profile and tune applications on intel processors, for which we need good intel processors that allow us to do this monitoring/profiling in an efficient way, i.e. it should be low overhead and it should support optimal amount of events to be monitored. We understand that this boils down to a large extent to the number of hardware performance counters. So a directly related question is that which intel processor has the best PMU (number of performance counters)?
Currently we are using two mainstream processors (we are open to any other type as long as the PMU support is good):
We would highly appreciate it if you could point to some good (even best) intel processors when talking about profiling capbilities.
Thanks a lot in advance,
it should be low overhead and it should support optimal amount of events to be monitored. We understand that this boils down to a large extent to the number of hardware performance counters. So a directly related question is that which intel processor has the best PMU (number of performance counters)?
You can only achieve low overhead access for fixed counters which are readable from the user space by calling the RDPMC machine code instruction.
For low overhead access of programmable counters you may use libpfc library and I presume (have not measure yet) , that overhead will be at least hundreds of cycles if not more for each PMC access. You would not have a multiplexing and thread-following on context switch and there is a need to set an affinity to specific thread.
There is a lot performance events (up to 400-500 on server SKU's) and very few performance counters (usually 8-10). So may use VTune or perf profiler and program it to measure either 4 events or 8 events per session, this way you will eliminate the counter multiplexing access overhead (at least hundreds of cycles), but you will need to run more session.
It is hard to say which CPU is the "best" as everything must be analyzed and some counter will undercount or overcount (you may ask Dr. Bandwidth).
Thank you for your clarification.
Based on what you are reporting, we are moving your thread for better assistance to:
Software Development Topics > Software Tuning, Performance Optimization & Platform Monitoring
Please kindly wait for a response.
Intel Customer Support Technician
I would like to inform you that after checking internally with higher levels and engineering department, they advised they don't think there are “better” or “more” PMUs. Sometimes they can be present and sometimes cannot and this is because all processor segments will have different numbers. For example, there’s a PMU for each core so CPUs with different numbers of cores will have different numbers of PMUs for them. Additionally, newer processors have more advanced feature sets.
Documentation on Intel® Performance Counter Monitor (Intel® PCM) which is discontinued:
The open PCM fork Intel contributes to now:
Intel Customer Support Technician
Links to third-party sites and references to third-party trademarks are provided for convenience and illustrative purposes only. Unless explicitly stated, Intel® is not responsible for the contents of such links, and no third-party endorsement of Intel or any of its products is implied.
The "mainstream" Intel processors for the last decade (i.e., excluding Atom, KNC/KNL) all have very similar performance monitoring infrastructure -- especially when considering only the core performance counters. In the standard configuration(s), each core has 3 fixed-function counters and either 4 (with HyperThreading enabled) or 8 (with HyperThreading disabled) programmable performance counters per core.
The biggest differences are between processors with the "client" uncore and those with the "server" uncore. The way that this distinction has been reflected in product naming has changed over the years, but it is usually correct to say that processors supporting 2 DRAM channels have the "client" uncore, while processors supporting more than 2 DRAM channels have the "server" uncore. This means that there are some crossovers -- e.g., the "Core i9" processors have the "server" uncore, while the "Xeon E3" processors typically have the "client" uncore. The best way to distinguish is using the DISPLAYFAMILY_DISPLAYMODEL convention and looking up values in the table at the beginning of Volume 4 of the Intel Architectures SW Developer's Guide (document 335592).
The type of "uncore" controls what can be measured using the "OFFCORE_RESPONSE" event in the core performance counters, as described in Chapter 18 of Volume 3 of the Intel Architectures SW Developer's Manual (document 325384). Careful review of the tables of performance counter events in Chapter 19 of that same manual will point to a few other small differences in the core performance counter events -- typically also related to the uncore. The. counter information from Chapter 19 is also available in tables at https://download.01.org/perfmon/, which are often more comprehensive and which are certainly easier to search and to post-process in software.
The "fast RDPMC" option (reading only the low-order 32-bits of a counter) is many generations obsolete and not relevant to recent processors. There are differences in the overhead of the RDPMC instruction across generations -- my testing has found that all of the Skylake and newer cores have very fast RDPMC -- something like 20-24 core cycles, if I recall correctly.
Although there are always a few unexpected bugs and a few cases in which interesting counters are dropped when going from one generation to the next, in general the best performance counter support is in the newest generation of "mainstream" processors. These have better baseline functionality (especially for PEBS/PDIR/LoadLatencyMonitoring) and typically have more bugs fixed than new bugs added in the basic performance counter event counting functionality.
Counters in the "uncore" are a much larger topic, but processors with the "server" uncore typically have dramatically more counter functionality in the uncore than processors with the "client" uncore. The performance counters in the "server" uncore are also moderately well documented in each generation, while performance counters in the "client" uncore are less consistently documented.